Skip to content

Generate Issue Text

You have a bunch of task instances with executable environments. You're very close to training SWE-agents on this data. There's one last step - let's generate issue text.

We primarily use LM's to generate issue text.

python swesmith/issue_gen/generate.py logs/task_insts/<repo>.json \
    --config_file configs/issue_gen/ig_v2.yaml \
    --model anthropic/claude-3-7-sonnet-20250219 \
    --n_workers 4 \
    --experiment_id ig_v2 \
    --use_existing

This will generated issue text for each task instance, producing several artifacts along the way:

  • Under logs/issue_gen/ig_v2/<repo>, there will be a folder for each task instance, containing:
    • messages.json: The messages fed to the LM to generate the issue text.
    • metadata.json: Conatins the issue text + inference cost.
  • In the same directory as logs/task_insts/<repo>.json, a logs/issue_gen/<repo>__ig_v2_n1.json file will be created, which is a copy of the original file with issue text added to each task instance (as the problem_statement field).

Alternatives

In our paper, we discuss several alternatives for generating issue text. While our experiments suggest that LM generated issue text is the best proxy for real issue text, we provide instructions for the alternatives below.

Static Issue Text

The problem statement is generated by randomly selecting one of 7 static issue text templates.

python swesmith/issue_gen/get_static.py logs/task_insts/<repo>.json

Produces a logs/issue_gen/<repo>__ig_static.json file.

Random F2P Test Case

The problem statement shows a randomly selected Fail-to-Pass test case from the task instance.

python swesmith/issue_gen/get_from_tests.py logs/task_insts/<repo>.json

Original Issue Text

Note

This strategy only works for some PR Mirrors, if the pull request the mirror is based on has issue(s) associated with it.

python swesmith/issue_gen/get_from_pr.py logs/task_insts/<repo>.json

Produces a logs/issue_gen/<repo>__ig_orig.json file.