Generate Issue Text
You have a bunch of task instances with executable environments. You're very close to training SWE-agents on this data. There's one last step - let's generate issue text.
We primarily use LM's to generate issue text.
python swesmith/issue_gen/generate.py logs/task_insts/<repo>.json \
--config_file configs/issue_gen/ig_v2.yaml \
--model anthropic/claude-3-7-sonnet-20250219 \
--n_workers 4 \
--experiment_id ig_v2 \
--use_existing
This will generated issue text for each task instance, producing several artifacts along the way:
- Under
logs/issue_gen/ig_v2/<repo>
, there will be a folder for each task instance, containing:messages.json
: The messages fed to the LM to generate the issue text.metadata.json
: Conatins the issue text + inference cost.
- In the same directory as
logs/task_insts/<repo>.json
, alogs/issue_gen/<repo>__ig_v2_n1.json
file will be created, which is a copy of the original file with issue text added to each task instance (as theproblem_statement
field).
Alternatives
In our paper, we discuss several alternatives for generating issue text. While our experiments suggest that LM generated issue text is the best proxy for real issue text, we provide instructions for the alternatives below.
Static Issue Text
The problem statement is generated by randomly selecting one of 7 static issue text templates.
python swesmith/issue_gen/get_static.py logs/task_insts/<repo>.json
Produces a logs/issue_gen/<repo>__ig_static.json
file.
Random F2P Test Case
The problem statement shows a randomly selected Fail-to-Pass test case from the task instance.
python swesmith/issue_gen/get_from_tests.py logs/task_insts/<repo>.json
Original Issue Text
Note
This strategy only works for some PR Mirrors, if the pull request the mirror is based on has issue(s) associated with it.
python swesmith/issue_gen/get_from_pr.py logs/task_insts/<repo>.json
Produces a logs/issue_gen/<repo>__ig_orig.json
file.