Vol.01 · No.10 Daily Dispatch May 26, 2026

Latest AI News

AI · PapersDaily CurationOpen Access
AI NewsResearch
5 min read

Reasoning, perception, and 3D agents: four papers reframe how models think

New research frames inference as converging to learned ‘attractors,’ treats model training as a noisy channel with capacity limits, shows vision-language models learn more by separating seeing from thinking, and turns language-driven virtual photography into an executable 3D agent task.

Reading Mode

One-Line Summary

Four papers push scalable, measurable reasoning: stable attractors for inference, Shannon-style capacity limits, staged vision-language training, and a 3D photo agent benchmark.

Research Papers

Equilibrium Reasoners learn stable ‘attractors’ for scalable reasoning

This work treats reasoning as letting a hidden state settle into a valid solution, like a marble rolling into a bowl. The model updates an internal latent state until it converges, and performance improves when you spend more test-time compute by running extra iterations or combining multiple randomized starts. 1

The authors show adaptive compute: easy cases converge in 1–5 steps, while hard ones benefit from massive unrolling—up to the equivalent of 40,000 layers. On Sudoku-Extreme, accuracy jumps from 2.6% for a feedforward baseline to over 99% when using this scalable latent reasoning, without external verifiers or task-specific priors. Gains track how cleanly states converge to solution-aligned attractors. 1

LLMs as noisy channels explains U-shaped scaling

The authors model training of Large Language Models (LLMs) as sending information over a noisy channel: parameters act like bandwidth, tokens like signal power, and the signal-to-noise ratio determines whether scaling helps or hurts. This Shannon Scaling Law explains non-monotonic effects—like quantization hurts or catastrophic overtraining—by showing how performance can turn U-shaped when noise grows faster than signal. 2

Validated on Pythia and OLMo2 under Gaussian noise, quantization, and supervised fine-tuning (SFT) for math, QA, and code, the law outperforms classical monotonic scaling laws. Fitted on models up to 6.9B parameters and 180B tokens, it predicts the unseen 12B model up to 307B tokens with pooled R^2 = 0.847, capturing loss basins missed by prior approaches. 2

From seeing to thinking: staged post-training boosts VLMs with shorter chains

This study splits post-training for vision-language models (VLMs) into three capability stages—visual perception, visual reasoning, and textual reasoning—with specialized data per stage. The key finding: visual perception is the main bottleneck and is learned more effectively via reinforcement learning (RL) than caption-based supervised fine-tuning (SFT). 3

Across multiple VLMs, staged training consistently improves both perception and reasoning: 1.5% higher reasoning accuracy with 20.8% shorter chain-of-thought (CoT) traces. As a concrete anchor, the approach reports +5.2% on WeMath and +3.7% on RealWorldQA among open-weight models, and combining this “ability staging” with difficulty-based curricula brings further additive gains. 3

PhotoFlow turns language-driven virtual photography into an executable 3D agent task

PhotoFlow asks an agent to enter an arbitrary Blender scene, interpret a language intent, pick camera parameters, and render a final photo. It uses a Director–Reviewer–Reflector loop: the Director drafts diverse camera candidates, the Reviewer checks rules and visual quality while selecting incumbents, and the Reflector turns failures into region memory, dead-zone suppression, and high-exploration relocation. 4

The team introduces VPhotoBench with 47 open-license Blender scenes and 141 language-conditioned missions. Under a six-round rendering budget, PhotoFlow achieves the strongest composite quality-alignment and success rate versus one-shot, single-chain reflection, anchor-bank selection, and random search; the authors position it as the first executable agent setup for language-conditioned virtual photography in arbitrary Blender scenes. 4

Community Pulse

Hacker News (1959↑) — Mixed reactions: users question whether Claude Opus 4.7 improves over 4.6, citing unclear context limits and possible medium-level reasoning regressions. 5

"Is anyone else noticing that the benchmarks for Claude 4.7 don't specify the token window? Cursor, and LiteLLM at my company, limit the token window to 200k. It feels like to me like 4.7 is not better, and is maybe worse than 4.6 when capped to 200k context window. Does anyone have stats on performance of 4.6 vs. 4.7 when context window is capped at 200k?" — Hacker News 5

"Medium reasoning has regressed since 4.6. While None and Max have improved since 4.6 in our benchmark. We suspect that this is how Claude tries to cope with the increased user base. Note, Google and OpenAI probably did something similar long ago." — Hacker News 5

Why It Matters

Two complementary mental models emerge for teams: treat reasoning as a dynamical system that should reliably converge (attractors), and budget test-time compute accordingly for hard cases. This clarifies when to spend extra iterations and when to stop. 1

A Shannon-style view warns that simply scaling parameters or data can backfire if signal-to-noise isn’t preserved, while staged perception-first training and 3D agent benchmarks point toward more grounded evaluation of “thinking” tied to what models actually see and do. 2

This Week to Try

  1. Equilibrium Reasoners, fast skim: read the abstract and figures to internalize the “attractor” idea. https://arxiv.org/abs/2605.21488
  2. Seeing→Thinking prompt drill: read the staged recipe, then try describing an image in bullet “perception” notes before asking the model to reason. https://arxiv.org/abs/2605.20177

Sources 5

Helpful?

Comments (0)