One model, faster tokens: MARS boosts throughput without extra heads
A lightweight fine-tune lets standard autoregressive models emit multiple tokens per step, delivering up to 1.71x speedups without a second draft model. Meanwhile, a new RL study shows why agents can look 'diverse' yet ignore inputs—and how to fix it.
One-Line Summary
Models speed up by generating several tokens per step without extra components, while new diagnostics reveal hidden reasoning failures in agents and a bold blueprint reframes models as computers.
Research Papers
MARS: Multi-token generation for autoregressive models
This work shows how a standard autoregressive model can output several tokens per step using a lightweight fine-tune, instead of generating strictly one token at a time. The method, called MARS, keeps the original architecture and parameters, matches or exceeds baseline accuracy on six instruction benchmarks at 1-token mode, and delivers about 1.5–1.7x throughput when accepting multiple tokens per step. On Qwen2.5-7B with a block-level KV cache for batch inference, the authors report up to 1.71x wall-clock speedup compared to baseline AR with KV cache. 1
Unlike speculative decoding—which runs a separate small draft model to propose several tokens that the target model verifies in parallel—MARS needs no auxiliary model, new heads, or API changes: it’s a single fine-tuned model called exactly like the original. This makes it operationally simpler than draft-model speculation or multi-head methods like Medusa, while preserving output quality at baseline levels. A confidence threshold lets serving systems adjust, in real time, how many tokens to accept per step—trading tiny accuracy shifts for higher throughput under load. 1
For context, speculative decoding in production commonly reports 2–3x speedups by proposing 5–8 tokens and verifying them in parallel, with some NVIDIA H200 results citing around 3.6x throughput; MARS’s value is offering consistent gains without a second model to manage. Teams deciding between approaches can weigh infrastructure complexity, memory headroom, and acceptance-rate sensitivity versus MARS’s single-model simplicity and on-the-fly confidence control. 2
Practically, MARS’s block-level KV caching and batching matter for real systems where token budgets, rate limits, and latency targets collide; adopting a method that speeds generation without extra model footprints can simplify capacity planning. Its “latency-quality knob” is also aligned with live traffic management, where operators need levers beyond static batch sizes and caching to keep p99s steady. 3
RAGEN-2: Diagnosing reasoning collapse in agentic RL
This paper explains why multi-turn RL training for agents can look stable but still ignore the input, and introduces a way to spot and fix it. The authors show that entropy—a popular diversity metric—only measures variation within the same prompt, missing a hidden failure mode where models rely on “fixed templates” that look varied but don’t actually respond to different inputs (template collapse). 4
They decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability via Mutual Information (MI), and propose online MI proxies that correlate more strongly with final task performance across planning, math, web navigation, and code execution. Mechanistically, they trace collapse to low reward variance reducing task gradients so regularization dominates, erasing input-dependent reasoning; their SNR-Aware Filtering selects high-signal prompts using reward variance, improving both input dependence and task performance. 4
For builders, this aligns with broader lessons from retrieval and agent systems: prioritize signal over noise. In practice, pruning noisy context and guarding against stale world models materially raises downstream quality and reduces quiet failures that compound over long horizons—an analysis of long-running agents highlights how context drift and goal drift drive a large share of failures in 30+ step tasks. Treating reward variance as a signal-to-noise proxy during training is a complementary lever to keep agents input-grounded. 5 6
Neural Computers: Toward models that are the computer
This position paper proposes “Neural Computers,” models that unify compute, memory, and I/O inside a learned runtime so the model itself acts like the running computer. As an initial step, the authors instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions in CLI and GUI settings, showing early interface primitives like I/O alignment and short-horizon control—while noting that routine reuse, controlled updates, and symbolic stability remain open challenges. 7
Unlike conventional programs, tool-using agents, or learned world models, Neural Computers aim for durable capability reuse and explicit reprogramming inside a single neural runtime. The long-term goal—Completely Neural Computer (CNC)—would provide stable execution and editable “programs” within the model; today’s results are a first probe trained only from I/O traces without instrumented program state. 7
The roadmap sketches research threads for making NCs practical: stabilizing long executions, enabling safe, controlled updates of internal routines, and achieving symbolic reliability. For readers tracking compute paradigms, this reframes the boundary between “model,” “program,” and “OS,” suggesting a future where interface control policies and memory live natively in neural state rather than in external orchestrators. 7
Why It Matters
A recurring theme emerges: speed, stability, and system simplicity. MARS shows you can gain 1.5–1.7x throughput—and up to 1.71x in batched settings—without a draft model or extra heads; RAGEN-2 shows why agents “look” smart yet miss inputs and how MI-based proxies and SNR-aware filtering keep reasoning grounded; Neural Computers question where computation should live at all. Together, these push toward faster responses, more reliable long-horizon behavior, and new ways to package logic and memory. 1 4 7
Try This Week
- MARS vs. speculation explainer: Read a practical guide on token budgets, caching, and latency trade-offs to frame when multi-token decoding or speculation fits your stack. 3
- Agent reliability primer: Skim a field report on “stale world models” to spot quiet failure modes in long-running agents before they hit production. 6
Comments (0)