Vol.01 · No.10 CS · AI · Infra May 14, 2026

AI Glossary

GlossaryReferenceLearn
LLM & Generative AI

SWE-bench

Difficulty

Plain Explanation

Traditional coding tests for AI often focus on one small function, which misses the real job of software work: reading a large codebase, understanding a vague issue, changing multiple files, and not breaking anything else. Teams needed a way to compare models and agents on tasks that look like real GitHub tickets, not just toy snippets. SWE-bench solves this by turning resolved GitHub issues into evaluation tasks. The AI gets the repository as it existed before the human fix and the original issue text. It must produce a patch; success is measured by whether hidden tests that used to fail now pass, and whether unrelated regression tests still pass. Mechanically, each task includes two test sets: fail-to-pass tests that should only pass if the bug is fixed, and pass-to-pass tests that must remain green. The model or agent never sees the tests, so it cannot just “fit to” them. The original benchmark drew tasks from 12 open-source Python repositories, and later variants introduced curated subsets and leaderboards to track progress and comparability.

Examples & Analogies

  • Model selection for a coding assistant: A platform team compares candidate models using the Lite and Verified leaderboards to see which ones resolve a higher share of real GitHub issues. They prefer a model whose patches pass both the fail-to-pass and regression checks.
  • Agent evaluation on long projects: A research group tests an autonomous coding agent on SWE-bench Pro to see if it can handle long-horizon tasks that require multiple coordinated steps across a repository. They analyze not just final pass rates but whether the agent can navigate complex repo context.
  • Reality check with mutated prompts: An engineering org applies a benchmark-mutation approach that rewrites formal GitHub issues into chat-style queries. They observe drops versus the original tasks, revealing that public benchmarks can overestimate performance for interactive IDE use.

At a Glance

HumanEvalSWE-bench (Lite/Verified)SWE-bench Pro
Task scopeSingle-function codingRepo-level bug/issue patchesLong-horizon repo tasks
Tests visibilityOften explicit/knownHidden fail-to-pass + regressionHidden; multi-step challenges
Interaction styleOne-shot generationSingle-turn patch generationExtended, multi-step agent flows
LanguagesGeneral coding toy tasksPython-only, 12 reposRepo-level, long-horizon focus
Data concernsSaturated on top modelsContamination concerns reportedRecommended to reduce contamination risk

SWE-bench moves beyond toy snippets to repo-level fixes with hidden tests, while SWE-bench Pro targets longer, multi-step software tasks and is recommended when contamination of public sets is a concern.

Where and Why It Matters

  • OpenAI evaluation practice: OpenAI reported test-design and contamination concerns in SWE-bench Verified and recommended pairing or shifting frontier reporting toward SWE-bench Pro.
  • Public leaderboards as a norm: Lite (~300 tasks) and Verified (500 tasks) leaderboards became central places to compare solutions across common harnesses.
  • Agent benchmarking expectations: Long-horizon behavior is now evaluated with SWE-bench Pro to reflect multi-step, repo-wide changes rather than single-turn patches.
  • Misuse warning in procurement: Improvements on public Verified can reflect training exposure, so buyers increasingly ask for Pro results or alternative, uncontaminated sets.
  • Research focus shift: Studies now examine contamination, test design narrowness/wideness, and more realistic prompt styles to align with IDE chat usage.

Common Misconceptions

  • ❌ Myth: "A higher SWE-bench Verified score always means better real-world coding." → ✅ Reality: Verified is useful but has reported contamination and test-design concerns; treat it as one signal and compare with newer or less-exposed sets such as Pro.
  • ❌ Myth: "SWE-bench measures how well chat agents collaborate over many steps." → ✅ Reality: The classic setup checks single-turn patches; long-horizon evaluation is the focus of SWE-bench Pro.
  • ❌ Myth: "SWE-bench covers many languages evenly." → ✅ Reality: It is Python-only and weighted toward specific repositories; results can be skewed by that focus.

How It Sounds in Conversation

  • "Before we lock a provider, can we see SWE-bench Pro numbers, not just Verified?"
  • "Ops flagged that contamination may inflate our % resolved on the public set—let’s validate on Pro."
  • "Our agent nails single-turn patches, but stalls on long-horizon tasks; Pro will expose that gap."
  • "The board wants one metric; we’ll report SWE-bench Lite/Verified plus a Pro run for due diligence."
  • "Let’s replicate using the same hidden test harness and document any environment drift we see."

Related Reading

References

Helpful?