New research shifts AI scaling toward test-time compute and signal quality
Three papers propose attractor-based reasoning, a Shannon scaling law, and staged vision training—pointing to better accuracy by tuning compute and reducing noise. Here’s what it means for budgets, prompts, and vendor evaluations.
One-Line Summary
New research pushes AI scaling toward smarter test-time compute, signal-to-noise-aware training, and perception-first vision models.
Industry & Biz
Attractor-based 'equilibrium reasoners' scale accuracy with inference steps
Equilibrium Reasoners (EqR) are iterative models that update a latent state until they converge to "attractors"—stable points that correspond to valid solutions—letting systems improve answers by simply running more steps or aggregating multiple runs. The paper shows EqR can scale depth (more iterations) and breadth (multiple stochastic initializations) at test time without external verifiers or task-specific priors. 1
In experiments, simple cases converge within 1–5 steps, while difficult ones benefit from unrolling up to the equivalent of 40,000 layers, lifting accuracy from 2.6% in feedforward baselines to over 99% on Sudoku-Extreme. For teams, this frames a practical dial: allocate more inference iterations or multi-sample on hard tasks to trade time for accuracy. 1
Shannon scaling law explains when bigger models get worse
LLMs as Noisy Channels proposes a Shannon Scaling Law that treats model training like transmitting information over a noisy channel, mapping parameters to bandwidth and training tokens to signal power. The framework explains non-monotonic effects—catastrophic overtraining and quantization degradation—by showing that scaling size or data without preserving signal-to-noise ratio can induce U-shaped performance. 2
Across Pythia and OLMo2 with Gaussian noise, quantization, and supervised fine-tuning on math, QA, and code, the law fits models up to 6.9B parameters on up to 180B tokens and extrapolates to a 12B model with up to 307B tokens at pooled R^2=0.847, beating classical power-law fits. For decision-makers, it backs a "right-size-and-clean" approach over "just add more." 2
Staged training lifts vision-language accuracy by fixing perception first
From Seeing to Thinking reports that in vision-language models, weak visual perception—not long chain-of-thought reasoning—is often the main limiter, and recommends separating post-training into visual perception, visual reasoning, and textual reasoning stages. It finds perception needs targeted data and is more effectively learned with reinforcement learning than caption-style supervised fine-tuning. 3
Across multiple VLMs, the staged approach raises both perception and reasoning, delivering 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces and gains of +5.2% on WeMath and +3.7% on RealWorldQA over the base. Practically, prompting or training in two steps—first "what do you see?", then "what does it mean?"—can reduce tokens while improving answers. 3
Community Pulse
Hacker News (1959↑) — Mixed views on whether Claude Opus 4.7 improves over 4.6, with concerns about unspecified token-window limits, mid-tier reasoning regressions, and trade-offs to handle scale. 4
"Is anyone else noticing that the benchmarks for Claude 4.7 don't specify the token window? Cursor, and LiteLLM at my company, limit the token window to 200k. It feels like to me like 4.7 is not better, and is maybe worse than 4.6 when capped to 200k context window. Does anyone have stats on performance of 4.6 vs. 4.7 when context window is capped at 200k?" — Hacker News 4
"Medium reasoning has regressed since 4.6. While None and Max have improved since 4.6 in our benchmark. We suspect that this is how Claude tries to cope with the increased user base. Note, Google and OpenAI probably did something similar long ago." — Hacker News 4
What This Means for You
If your workflow tolerates a few extra seconds, treat inference as a budget you can dial up on hard problems. Iterating more steps or running multiple samples can improve answers on tasks that need careful reasoning—aligned with EqR’s finding that even small step increases help and hard cases benefit from far more unrolled computation. 1
Resist the reflex to buy larger models or longer contexts before cleaning inputs. The Shannon view indicates performance can degrade when noise grows faster than signal, so focus first on tighter prompts and higher-signal corpora—deduping, compressing, and removing boilerplate—then re-measure before upgrading spend. 2
For image-heavy tasks, split work into two phases: perception first, reasoning second. A perception-first setup aligns with the staged-training results, which show accuracy gains and shorter reasoning traces—helpful for both quality and token/latency budgets. 3
When evaluating vendors, ask for benchmark context-window details and performance when windows are capped (e.g., 200k), and how quantization or traffic scaling affects quality. The live discussion around Claude 4.7 vs. 4.6 underscores why these specifics matter in practice. 4
Action Items
- Turn up test-time compute on tough tasks: For complex planning or puzzle-like prompts, increase max tokens, ask for step-by-step reasoning, and run 3–5 samples to see if aggregation improves reliability.
- Clean and condense your context: Take one 10-page spec you often paste into your assistant, remove boilerplate/duplicates, compress to 1–2 pages, and compare answer quality and latency.
- Split vision prompts into two steps: First ask the model to list objects/text/attributes in an image; then ask your question. Track whether accuracy rises and tokens fall.
- Press vendors for specifics: Request reported token-window sizes in benchmarks, accuracy under a capped window (e.g., 200k), and any iteration-vs-accuracy curves for reasoning tasks.
Comments (0)