This commit is contained in:
stellahsr 2024-03-27 11:31:34 +08:00
parent f99c7b354e
commit 3b8c83db3b
5 changed files with 4 additions and 4 deletions

View file

@ -0,0 +1,85 @@
# Dataset Description
The index of sub_swebench is a subset of swebench, with two columns in total, each column containing 50 id_instance.
The id_instance is a balanced subset of pass and fail samples for CognitionAI on swebench.
The index of scikit-learn-68 is another subset of CognitionAI in swebench (all tasks of the scikit-learn type), with a total of two columns
- pass12
- fail56
Sampling list:https://github.com/CognitionAI/devin-swebench-results/tree/main/
Original datasethttps://huggingface.co/datasets/princeton-nlp/SWE-bench/
## fail dataset Description
There are a total of 491 txt files listed.
In the original dataset, the distribution of pass case categories is:
- astropy: 24
- django: 160
- matplotlib: 42
- mwaskom: 4
- pallets: 3
- psf: 9
- pydata: 29
- pylint-dev: 13
- pytest-dev: 20
- scikit-learn: 56
- sphinx-doc: 46
- sympy: 85
### After balanced sampling:
There are a total of 50 txt files listed.
- Django: 16
- Scikit-Learn: 6
- Sympy: 10
- sphinx-doc:5
- matplotlib: 4
- pydata: 3
- astropy: 2
- pytest-dev: 2
- psf: 1
- pylint-dev: 1
## pass dataset Description
There are a total of 79 txt files listed.
In the original dataset, the distribution of pass case categories is:
- astropy: 4
- django: 38
- matplotlib: 3
- pydata: 3
- pytest-dev: 6
- scikit-learn: 12
- sphinx-doc: 2
- sympy: 11
### After balanced sampling:
There are a total of 50 txt files listed.
- Django: 23
- Scikit-Learn: 8
- Sympy: 7
- Pytest: 4
- Astropy: 3
- Xarray (pydata): 2
- Matplotlib: 2
- Sphinx: 1
## scikit-learn-68 dataset Description
instance_id_pass:12
instance_id_fail:56

Binary file not shown.
Can't render this file because it contains an unexpected character in line 2 and column 280.

Binary file not shown.
Can't render this file because it has a wrong number of fields in line 2.

View file

@ -20,13 +20,13 @@ def load_oracle_dataset(dataset_name_or_path: str = "", split: str = "test", exi
lens = np.array(list(map(len, dataset["text"])))
dataset = dataset.select(np.argsort(lens))
if len(existing_ids) > 0:
if existing_ids:
dataset = dataset.filter(
lambda x: x["instance_id"] not in existing_ids,
desc="Filtering out existing ids",
load_from_cache_file=False,
)
if len(SCIKIT_LEARN_IDS) > 0:
if SCIKIT_LEARN_IDS:
dataset = dataset.filter(
lambda x: x["instance_id"] in SCIKIT_LEARN_IDS,
desc="Filtering out subset_instance_ids",

View file

@ -5,8 +5,8 @@ import pandas as pd
from metagpt.const import METAGPT_ROOT
SUBSET_DATASET = METAGPT_ROOT / "sub_swebench_dataset" / "sub_swebench.csv"
SUBSET_DATASET_SKLERARN = METAGPT_ROOT / "sub_swebench_dataset" / "scikit-learn-68.csv"
SUBSET_DATASET = METAGPT_ROOT / "benchmark" / "swe_bench" / "sub_swebench_dataset" / "sub_swebench.csv"
SUBSET_DATASET_SKLERARN = METAGPT_ROOT / "benchmark" / "sub_swebench_dataset" / "scikit-learn-68.csv"
TESTBED = METAGPT_ROOT / "benchmark" / "swe_bench" / "data" / "repos"
# SCIKIT_LEARN_IDS: A list of instance identifiers from 'sub_swebench.csv' within SUBSET_DATASET.