Vol.01 · No.10 CS · AI · Infra May 14, 2026

AI Glossary

GlossaryReferenceLearn
LLM & Generative AI

Agent Evaluation

Difficulty

Plain Explanation

Traditional model tests grade a single answer to a single prompt. But agents don’t work that way—they plan over multiple steps, call tools, observe results, and adapt. The problem is that a single “final answer” can hide whether the agent took a fragile path, wasted time, or simply retried until it got lucky. Agent evaluation solves this by looking at the whole trajectory. A helpful analogy is reviewing a project, not just the final slide: you inspect the steps taken, resources used, and changes made along the way. In practice, this means checking the sequence of actions, whether the environment actually changed as intended, and how efficiently the agent reached the goal. Mechanically, two layers are scored: the reasoning layer (planning/decision quality) and the action layer (tool invocation quality and state updates). Evaluations add efficiency metrics—like number of steps, execution time, and token consumption—and run inside a standardized sandbox so prompt templates, tool lists, and environment dynamics don’t secretly drive the outcome.

Examples & Analogies

  • "Web task execution in a sandboxed browser": The agent must navigate a controlled website to complete tasks (e.g., locate a page, fill a form, submit). Evaluation records if the task state changes as required and how many steps, tool calls, and tokens it used.
  • "Clinical calculation workflow": An agent receives patient data, plans steps, calls a calculation tool, and outputs a structured result. The run is checked for correct formula application, valid formatting, and whether tool calls were necessary or excessive.
  • "Program repair with capped retries": In a coding benchmark, the agent plans edits, runs tests, and iterates. The evaluation gates accuracy by a strict budget on attempts/tokens to prevent score inflation from unlimited retries.

At a Glance

Agent evaluationModel evaluation
What’s scoredTrajectories + final stateSingle output
EnvironmentInteractive, stateful sandboxStatic prompt–response
MetricsSuccess + steps/time/tokensAccuracy/quality on items
Error tracingSplit reasoning vs actionWhole output at once
Fairness risksPrompts/tools/env confoundsData contamination/overfit

Agent evaluation adds interaction- and efficiency-aware scoring and isolates where failures occur, while model evaluation focuses on static-response correctness.

Where and Why It Matters

  • Shift toward standardized sandboxes: Benchmarks define tools and environment dynamics so differences reflect agent capability, not framework quirks.
  • Cost-aware reporting: Studies emphasize capping retries and plotting accuracy vs cost, discouraging designs that buy gains with unbounded tokens.
  • Multi-metric dashboards: Beyond success rate, teams track trajectory length, execution time, and token consumption to compare agents fairly.
  • Reproducibility focus: Evaluations document prompts, toolsets, and environment seeds to reduce variance and enable apples-to-apples comparisons.

Common Misconceptions

  • ❌ Myth: A correct final answer means the agent is good. → ✅ Reality: It may have overfit the setup or brute-forced with retries; check trajectory and cost.
  • ❌ Myth: Benchmarks are comparable across papers by default. → ✅ Reality: Different prompts, tools, and environments can skew results without a unified framework.
  • ❌ Myth: Higher accuracy always means a better agent. → ✅ Reality: Without cost controls, small gains can hide massive increases in steps and tokens.

How It Sounds in Conversation

  • "Let’s separate the reasoning layer score from tool-use so we know if planning or execution is failing."
  • "Please rerun on the same sandbox config; the tool list changed and that makes the result incomparable."
  • "Plot success vs tokens—we need to see if those gains came from unbounded retries."
  • "Add trajectory traces and environment diffs so we can attribute errors to planning vs API failures."
  • "Lock the system prompt and seed; variance across runs is masking real regressions."

Related Reading

References

Helpful?