LLM & Generative AI

SWE-bench

Difficulty

Plain Explanation

Traditional coding tests for AI often focus on one small function, which misses the real job of software work: reading a large codebase, understanding a vague issue, changing multiple files, and not breaking anything else. Teams needed a way to compare models and agents on tasks that look like real GitHub tickets, not just toy snippets. SWE-bench solves this by turning resolved GitHub issues into evaluation tasks. The AI gets the repository as it existed before the human fix and the original issue text. It must produce a patch; success is measured by whether hidden tests that used to fail now pass, and whether unrelated regression tests still pass. Mechanically, each task includes two test sets: fail-to-pass tests that should only pass if the bug is fixed, and pass-to-pass tests that must remain green. The model or agent never sees the tests, so it cannot just “fit to” them. The original benchmark drew tasks from 12 open-source Python repositories, and later variants introduced curated subsets and leaderboards to track progress and comparability.

Examples & Analogies

Model selection for a coding assistant: A platform team compares candidate models using the Lite and Verified leaderboards to see which ones resolve a higher share of real GitHub issues. They prefer a model whose patches pass both the fail-to-pass and regression checks.
Agent evaluation on long projects: A research group tests an autonomous coding agent on SWE-bench Pro to see if it can handle long-horizon tasks that require multiple coordinated steps across a repository. They analyze not just final pass rates but whether the agent can navigate complex repo context.
Reality check with mutated prompts: An engineering org applies a benchmark-mutation approach that rewrites formal GitHub issues into chat-style queries. They observe drops versus the original tasks, revealing that public benchmarks can overestimate performance for interactive IDE use.

At a Glance

	HumanEval	SWE-bench (Lite/Verified)	SWE-bench Pro
Task scope	Single-function coding	Repo-level bug/issue patches	Long-horizon repo tasks
Tests visibility	Often explicit/known	Hidden fail-to-pass + regression	Hidden; multi-step challenges
Interaction style	One-shot generation	Single-turn patch generation	Extended, multi-step agent flows
Languages	General coding toy tasks	Python-only, 12 repos	Repo-level, long-horizon focus
Data concerns	Saturated on top models	Contamination concerns reported	Recommended to reduce contamination risk

SWE-bench moves beyond toy snippets to repo-level fixes with hidden tests, while SWE-bench Pro targets longer, multi-step software tasks and is recommended when contamination of public sets is a concern.

Where and Why It Matters

OpenAI evaluation practice: OpenAI reported test-design and contamination concerns in SWE-bench Verified and recommended pairing or shifting frontier reporting toward SWE-bench Pro.
Public leaderboards as a norm: Lite (~300 tasks) and Verified (500 tasks) leaderboards became central places to compare solutions across common harnesses.
Agent benchmarking expectations: Long-horizon behavior is now evaluated with SWE-bench Pro to reflect multi-step, repo-wide changes rather than single-turn patches.
Misuse warning in procurement: Improvements on public Verified can reflect training exposure, so buyers increasingly ask for Pro results or alternative, uncontaminated sets.
Research focus shift: Studies now examine contamination, test design narrowness/wideness, and more realistic prompt styles to align with IDE chat usage.

Common Misconceptions

❌ Myth: "A higher SWE-bench Verified score always means better real-world coding." → ✅ Reality: Verified is useful but has reported contamination and test-design concerns; treat it as one signal and compare with newer or less-exposed sets such as Pro.
❌ Myth: "SWE-bench measures how well chat agents collaborate over many steps." → ✅ Reality: The classic setup checks single-turn patches; long-horizon evaluation is the focus of SWE-bench Pro.
❌ Myth: "SWE-bench covers many languages evenly." → ✅ Reality: It is Python-only and weighted toward specific repositories; results can be skewed by that focus.

How It Sounds in Conversation

"Before we lock a provider, can we see SWE-bench Pro numbers, not just Verified?"
"Ops flagged that contamination may inflate our % resolved on the public set—let’s validate on Pro."
"Our agent nails single-turn patches, but stalls on long-horizon tasks; Pro will expose that gap."
"The board wants one metric; we’ll report SWE-bench Lite/Verified plus a Pro run for due diligence."
"Let’s replicate using the same hidden test harness and document any environment drift we see."

References

★Paper2026
What’s in a Benchmark? The Case of SWE-Bench in Automated Program RepairMartinez, FranchICSE-SEIP '26
Study of Lite and Verified leaderboards, submissions, and ecosystem trends.
★Paper2026
Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent EvaluationGarg et al.
Shows public benchmarks can overestimate agent ability; proposes mutated tasks.
★Docs
Overview - SWE-bench
Official benchmark overview: task setup, datasets, and leaderboards.
★Code
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Open-source repo and context for evaluating long-horizon software tasks.
★Blog2026
Why SWE-bench Verified no longer measures frontier coding capabilitiesOpenAI
Audit of Verified: test flaws and contamination; recommendation to report Pro.
·Blog
SWE-bench Explained: How We Measure Real-World Coding
Plain-language comparison vs HumanEval and noted limitations.

Helpful?

0to1log Weekly

AI Glossary