Agent Loop
Plain Explanation
Teams needed a way to turn a one-shot model reply into a multi-step task that adapts to what actually happens. A single answer isn’t enough when the app must fetch data, call APIs, wait for a human, and try again after errors. The agent loop solves this by repeating a small cycle: think about the goal, choose an action, execute it, see what happened, and then decide the next move until done. Think of it like a careful chef following a recipe with taste checks. The chef reads the step (plan), adds ingredients or uses a tool (act), tastes the result (observe), and then adjusts seasoning before the next step (repeat). This prevents serving a dish that “looked right on paper” but tastes wrong, similar to how the loop avoids trusting a single unverified model response. Concretely, platforms document this loop as a series of durable steps: checkpoint each iteration so a crash at iteration 15 of 20 can resume there (not from scratch); optionally pause for human approval; run tool calls with retries and timeouts; stream partial replies; and cap iterations or tokens to control cost. Some systems serialize runs per session to avoid races, and many expose hooks to inspect or modify prompts, tools, or results before the next turn.
Examples & Analogies
- Durable crash recovery: An agent plans an action, calls a tool, and updates memory each turn inside a DO_WHILE loop. If the service restarts at iteration 15, the workflow resumes from that exact checkpoint with the full prompt, response, and timing history intact.
- Budget-guarded tool use: A loop processes a user message, invokes an LLM, executes tool calls, and feeds tool_result blocks back. It enforces a max of 10 iterations for simple tasks and stops early if accumulated input tokens exceed a set budget.
- Safe concurrency with streaming: A single serialized run per session avoids tool/session races while the agent streams assistant deltas and tool events. A lifecycle wait endpoint can return bounded status while long-running work continues in the background.
At a Glance
| Durable workflow loop | Event-driven tool loop | Session-serialized loop | |
|---|---|---|---|
| Durability | DO_WHILE checkpoints; resume mid-loop | step.run retries + checkpoints | Serialized runs; lifecycle events |
| Human gate | HUMAN task for durable approval | waitForEvent patterns | Hooks before_tool_call; guard/cancel |
| Tool calls | CALL_MCP_TOOL, HTTP, retries/backoff | Each tool in its own step; can parallelize | Streams tool start/update/end events |
| Budget/limits | Iteration caps, cost checks in loopCondition | Max iterations; track token usage | Default timeouts; long runtime window |
| Observability | Full audit trail per task | Granular step logs per tool/LLM call | Assistant/tool/lifecycle streams |
All three implement the same think–act–observe loop, differing mainly in how they checkpoint, gate with humans, stream events, and enforce budgets.
Where and Why It Matters
- Durable workflow pattern: Checkpointed loops with human approval and retries make multi-step agents recoverable and auditable across failures.
- Event-driven workflow pattern: Wrapping each LLM/tool step as an isolated task enables retries, iteration caps, and token tracking to control spend.
- Session-serialized runtime pattern: Running one active loop per session prevents tool/session races while streaming assistant and tool events.
- Shift toward guardrails: Iteration limits, token budgets, stuck-loop detection, and human approval became standard gates before agents can “run away” on cost or risk.
Common Misconceptions
- ❌ Myth: An agent loop is just a chat bot answering multiple times. → ✅ Reality: It is a structured control cycle with tool execution, observation, state updates, and clear stop conditions.
- ❌ Myth: If the server restarts, the agent must start over. → ✅ Reality: With checkpointed iterations, the loop resumes from the last completed step with a full audit trail.
- ❌ Myth: More looping always improves results. → ✅ Reality: Loops are capped by iteration and token budgets, and systems detect stuck behavior to stop safely.
How It Sounds in Conversation
- "Let’s add a HUMAN approval gate in the DO_WHILE so the agent can’t push changes without sign-off."
- "Wrap the LLM call and each tool in step.run — I want retries and per-step traces in the dashboard."
- "We’re hitting the token budget after 6 turns; cap iterations at 10 and prune older messages."
- "Keep one active run per session; we were seeing races between two tools writing the same thread."
- "If the model keeps picking the same tool, trigger our stuck-loop break and return a safe fallback."
Related Reading
References
- Agent Loop (OpenClaw) — End-to-end loop with streaming and hooks
Defines the loop as intake→context→inference→tools→streaming→persistence; serialized per session.
- Build an Agent Tool Loop (Inngest)
ReAct-style loop with durable step.run, iteration caps, token tracking, and stuck-loop detection.
- Production agent architecture — Durable execution for workflows and agents
Checkpointed DO_WHILE agent loop with HUMAN gates, retries, and full audit trail.
- Agent Architecture and Control Loops in AI System Design
Explains model, tools, instructions, and the continuous perceive→recall→reason→act→store loop.
- What Is the AI Agent Loop? The Core Architecture Behind Autonomous AI Systems
Plain-language overview of perceive→reason/plan→act→observe cycles and stop conditions.
- Mastering AI agent observability: A comprehensive guide
루프 단계 단위 추적·관측 포인트 설계 관점.