Rate Difficulty

To see how SWE-smith compares against real world tasks (e.g. SWE-bench), we LoRA Fine-Tuned a Qwen 2.5 32B Coder Instruct model on 1.5k human ratings of the difficulty of real world bugs.

Given the issue text and patch associated with a task instance, the model will rate the task as "easy" (< 15 min), "medium" (15 min - 1 hour), or "hard" (1+ hours).

Inference

You can rate the difficulty of your own task instances by following these steps:

Download the HuggingFace checkpoint.
Use sglang to serve the checkpoint. The training scripts available in the SWE-smith repository use Modal as a compute service for hosting inference.

N_HOURS=4 N_GPUS=4 modal run --detach swesmith/train/serve_sglang.py \
    --model-path /path/to/checkpoint \
    --served-model-name gpt-4o \
    --tokenizer-path /path/to/Qwen2.5-Coder-32B-Instruct

Run the following script:

python swesmith/train/difficulty_rater/get_difficulties.py \
    --base_url <URL where model is hosted> \
    --dataset_path path/to/dataset.json

The script will generate a .json file containing a mapping from each task instance to a difficulty score. You can then compute the dataset's difficulty score as the average of all task instance scores.

Prior Datasets

Using our model, we've assessed the difficulty of existing datasets, assigning scores of 1/5/9 to easy/medium/hard tasks.

Dataset	# Instances	Score	`easy`	`med`	`hard`
SWE-bench	2294	5.014	438	1408	446
└── Lite	300	3.893	93	197	10
└── Verified	500	3.960	173	284	43
SWE-bench Multimodal	510	6.036	55	265	186
SWE-gym	2438	5.625	288	1456	664
└── Lite	230	3.890	67	156	4
SWE-smith (LM Modify)	1000	3.304	441	542	17
SWE-smith (LM Rewrite)	1000	5.272	68	796	136
SWE-smith (Procedural)	1000	3.596	374	603	23
SWE-smith (PR Mirror)	1000	4.876	206	619	175
SWE-smith (Combine)	1000	5.720	52	716	232

From the table, we demonstrate that SWE-smith task instances are comparable to real world tasks, and that our bug generation techniques allow for a wide range of task difficulties.

Our projects