SWE-bench
Plain Explanation
Traditional coding tests for AI often focus on one small function, which misses the real job of software work: reading a large codebase, understanding a vague issue, changing multiple files, and not breaking anything else. Teams needed a way to compare models and agents on tasks that look like real GitHub tickets, not just toy snippets. SWE-bench solves this by turning resolved GitHub issues into evaluation tasks. The AI gets the repository as it existed before the human fix and the original issue text. It must produce a patch; success is measured by whether hidden tests that used to fail now pass, and whether unrelated regression tests still pass. Mechanically, each task includes two test sets: fail-to-pass tests that should only pass if the bug is fixed, and pass-to-pass tests that must remain green. The model or agent never sees the tests, so it cannot just “fit to” them. The original benchmark drew tasks from 12 open-source Python repositories, and later variants introduced curated subsets and leaderboards to track progress and comparability.
Examples & Analogies
- Model selection for a coding assistant: A platform team compares candidate models using the Lite and Verified leaderboards to see which ones resolve a higher share of real GitHub issues. They prefer a model whose patches pass both the fail-to-pass and regression checks.
- Agent evaluation on long projects: A research group tests an autonomous coding agent on SWE-bench Pro to see if it can handle long-horizon tasks that require multiple coordinated steps across a repository. They analyze not just final pass rates but whether the agent can navigate complex repo context.
- Reality check with mutated prompts: An engineering org applies a benchmark-mutation approach that rewrites formal GitHub issues into chat-style queries. They observe drops versus the original tasks, revealing that public benchmarks can overestimate performance for interactive IDE use.
At a Glance
| HumanEval | SWE-bench (Lite/Verified) | SWE-bench Pro | |
|---|---|---|---|
| Task scope | Single-function coding | Repo-level bug/issue patches | Long-horizon repo tasks |
| Tests visibility | Often explicit/known | Hidden fail-to-pass + regression | Hidden; multi-step challenges |
| Interaction style | One-shot generation | Single-turn patch generation | Extended, multi-step agent flows |
| Languages | General coding toy tasks | Python-only, 12 repos | Repo-level, long-horizon focus |
| Data concerns | Saturated on top models | Contamination concerns reported | Recommended to reduce contamination risk |
SWE-bench moves beyond toy snippets to repo-level fixes with hidden tests, while SWE-bench Pro targets longer, multi-step software tasks and is recommended when contamination of public sets is a concern.
Where and Why It Matters
- OpenAI evaluation practice: OpenAI reported test-design and contamination concerns in SWE-bench Verified and recommended pairing or shifting frontier reporting toward SWE-bench Pro.
- Public leaderboards as a norm: Lite (~300 tasks) and Verified (500 tasks) leaderboards became central places to compare solutions across common harnesses.
- Agent benchmarking expectations: Long-horizon behavior is now evaluated with SWE-bench Pro to reflect multi-step, repo-wide changes rather than single-turn patches.
- Misuse warning in procurement: Improvements on public Verified can reflect training exposure, so buyers increasingly ask for Pro results or alternative, uncontaminated sets.
- Research focus shift: Studies now examine contamination, test design narrowness/wideness, and more realistic prompt styles to align with IDE chat usage.
Common Misconceptions
- ❌ Myth: "A higher SWE-bench Verified score always means better real-world coding." → ✅ Reality: Verified is useful but has reported contamination and test-design concerns; treat it as one signal and compare with newer or less-exposed sets such as Pro.
- ❌ Myth: "SWE-bench measures how well chat agents collaborate over many steps." → ✅ Reality: The classic setup checks single-turn patches; long-horizon evaluation is the focus of SWE-bench Pro.
- ❌ Myth: "SWE-bench covers many languages evenly." → ✅ Reality: It is Python-only and weighted toward specific repositories; results can be skewed by that focus.
How It Sounds in Conversation
- "Before we lock a provider, can we see SWE-bench Pro numbers, not just Verified?"
- "Ops flagged that contamination may inflate our % resolved on the public set—let’s validate on Pro."
- "Our agent nails single-turn patches, but stalls on long-horizon tasks; Pro will expose that gap."
- "The board wants one metric; we’ll report SWE-bench Lite/Verified plus a Pro run for due diligence."
- "Let’s replicate using the same hidden test harness and document any environment drift we see."
Related Reading
References
- What’s in a Benchmark? The Case of SWE-Bench in Automated Program RepairICSE-SEIP '26
Study of Lite and Verified leaderboards, submissions, and ecosystem trends.
- Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation
Shows public benchmarks can overestimate agent ability; proposes mutated tasks.
- Overview - SWE-bench
Official benchmark overview: task setup, datasets, and leaderboards.
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Open-source repo and context for evaluating long-horizon software tasks.
- Why SWE-bench Verified no longer measures frontier coding capabilities
Audit of Verified: test flaws and contamination; recommendation to report Pro.
- SWE-bench Explained: How We Measure Real-World Coding
Plain-language comparison vs HumanEval and noted limitations.