Vol.01 · No.10 Daily Dispatch May 30, 2026

Latest AI News

AI · PapersDaily CurationOpen Access
AI NewsResearch
6 min read

Vision-language models mix up 'up' and 'far'; new benchmark exposes the bias

Researchers show a recurring photo-perspective shortcut across model families and release SpatialTunnel to separate true 3D reasoning from image-position cues.

Reading Mode

One-Line Summary

Today's papers surface a built-in photo-perspective shortcut in spatial reasoning and deliver practical tools: confidence-based memory for long contexts, agents that learn skills online, and a permissive 28-trillion-pixel image corpus.

Research Papers

Why Far Looks Up: spatial shortcuts in vision-language models

This study asks whether systems that read images and text truly understand three-dimensional (3D) space or rely on camera-perspective shortcuts. The authors probe vision-language models (VLMs) by constructing minimal contrastive image pairs and measuring how vertical, horizontal, and depth axes are organized and disentangled inside the models’ representations. 1

Across multiple model families, they find a consistent “vertical–distance entanglement”: models conflate vertical image position with distance, reflecting the perspective bias of natural photographs. This shortcut produces a sizable accuracy gap between perspective-consistent and counter-heuristic cases, and the gap intensifies as data scale grows even while standard benchmark scores improve; models with similar benchmark scores can nevertheless encode space differently, and those differences predict robustness on spatial reasoning tests. 1

To isolate the bias from dataset skew, the team introduces SpatialTunnel, a synthetic benchmark that strips common natural-image correlations. Experiments indicate the bias is model-intrinsic, and models with better-separated spatial axes are more robust. Code and benchmark are publicly available. 1

CONF-KV: using model confidence to shrink KV cache

Long responses make a model’s memory of previous tokens balloon, slowing generation and filling GPUs; this memory is the key–value (KV) cache in large language models (LLMs). CONF-KV converts the next-token distribution into a scalar confidence score to set a per-step cache budget—keeping more context when uncertain and pruning hard when confident—then ranks tokens by accumulated attention and recency with a protected recent window; it also pairs with blockwise online-softmax attention and mixed 16-bit floating point (FP16)/8-bit integer (INT8) storage. 2

Across four model families and generated lengths up to 4K tokens, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5–2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, it reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8× lower peak memory—pointing to confidence-aware caching as a practical lever for long-context apps. 2

PANDO: web agents that get cheaper as they learn

Many multimodal web agents get more expensive at inference time as teams add search rollouts, verifiers, and specialist stacks. PANDO aims for the opposite: a single-rollout online skill distillation framework that turns successful steps into reusable “skills,” maintains a structured Skill Library, and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. 3

On the full 910-task VisualWebArena suite, PANDO achieves a 58.3% success rate, surpassing SGV (54.0%) and a WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer than WALT—without any pre-evaluation discovery budget. A 300-task ablation shows rules and routines drive most gains, while routing, compression, and cache-aware prompting translate a larger skill library into lower marginal token cost; the paper also proposes trajectory-level efficiency metrics (Action Repetition Rate, Step Overhead Ratio, Prompt Cache Utilization) to make cost visible beyond a final success badge. 3

GPIC: a permissive 28-trillion-pixel image corpus

GPIC is a large, accessible, and stable dataset for training and benchmarking visual generation systems. It contains approximately 28 trillion pixels across 100M training, 200K validation, and 1M test images, all captioned by a state-of-the-art vision-language model and permissively licensed for both research and commercial use; images are safety-filtered, deduplicated, and centrally hosted. 4

The authors provide a generative modeling benchmark protocol on GPIC and a reference baseline for pixel-space flow matching, along with an evaluation toolkit and code—lowering friction for reproducible work on data scale. 4

Open Source & Repos

vLLM: high-throughput, memory-efficient LLM serving hits v0.22.0

vLLM is an inference and serving engine for large language models (LLMs) focused on easy, fast, and cost-efficient deployment. It targets high-throughput, memory-aware serving for production. 5

Release v0.22.0 (May 29, 2026) includes 459 commits from 230 contributors, including 63 new contributors. Highlights note a hardening pass and a reorganization for DeepSeek V4 under a dedicated package path, among other updates. 5

Why It Matters

The lead paper shows that better benchmark scores do not guarantee well-structured spatial understanding; scaling can amplify perspective shortcuts, so teams need representation-level probes and synthetic tests to validate 3D reasoning. 1

The supporting work points to practical levers—confidence-aware memory management, online skill reuse in agents, permissive-scale data, and mature serving stacks—that let builders improve robustness and efficiency under real constraints. 2

Sources 5

Helpful?

Comments (0)