AI NewsResearch

4 min read 6/3/2026

video MLLMpredictive codingrobot affordancemixture of expertsoptimal transportmultimodal

Video AI speeds up by sending only what changes

AdaCodec compresses redundant frames, slashing token budgets and cutting time-to-first-token from 9.26s to 1.62s; plus fresh work on robot affordances and on turning dense models into expert mixtures.

Find in this article

Reading Mode

One-Line Summary

Three papers show how smarter representations let AI do more with less: compress video changes, learn where and how to act, and convert dense models into efficient experts.

Research Papers

AdaCodec: predictive visual code speeds up video reasoning

AdaCodec changes how video assistants feed visuals to a model: instead of sending every frame as a full image, it sends a full reference frame only when needed and otherwise transmits compact difference tokens that capture motion and residuals. The authors call this a predictive visual code and implement it for video multimodal large language models (video MLLMs). ¹

Across eleven benchmarks, AdaCodec improves over a per-frame RGB baseline using Qwen3-VL-8B at the same visual-token budget. Even at one-seventh the budget (32k tokens), it surpasses the 224k-token baseline on all long-video benchmarks; on five general-video benchmarks, it also raises the average score while cutting time-to-first-token from 9.26 seconds to 1.62 seconds. ¹

For practitioners, this means faster first answers and lower cost on long clips without sacrificing accuracy. Technically, AdaCodec spends full visual tokens only when predictive cost is high and packs inter-frame changes into compact P-tokens, an idea similar to sending keyframes plus deltas in video codecs. ¹

AFUN: affordance masks and motion from a single view

AFUN aims to teach robots both where to interact and how to move after contact, using a single RGB-D (color plus depth) observation and a language-described task. It predicts a task-conditional functional mask (where) and a 3D post-contact motion curve (how), pushing toward an affordance foundation model for functionality understanding. ²

Built on a standardized data pipeline that unifies robot, human, simulation, and real-world scans into a shared schema with language, masks, and object-centric 3D motion labels, AFUN reports large gains: mean generalized Intersection-over-Union (gIoU) and class-wise IoU (cIoU) improve by 23.9 and 26.3 points across eight test sets; contact-point hit rate increases by 12.7–61.3%; and 3D motion performance leads on all three test sets. The model deploys to real robot manipulation without finetuning for embodiment or task-specific heuristics. ²

DOT-MoE: turning dense layers into experts with optimal transport

DOT-MoE converts a pre-trained dense model into a sparse Mixture of Experts (MoE) by formulating neuron assignment in dense Feed-Forward Network (FFN) layers as a Differentiable Optimal Transport (DOT) problem. Instead of heuristic clustering or random splits, it uses Sinkhorn-Knopp iterations to enforce balanced expert capacity and Straight-Through Estimators (STE) to jointly learn discrete neuron-to-expert assignment and token-to-expert routing end to end. ³

Across multiple architectures and benchmarks, DOT-MoE retains about 90% of the original dense model’s performance while halving active parameters by 50%, outperforming structured pruning, heuristic clustering, and random splits. Takeaway: you can get MoE-style efficiency without training from scratch, with a principled routing and capacity mechanism. ³

Why It Matters

All three works point to “smarter tokens and routing”: send only what changes in video (AdaCodec), specify where and how a robot should act (AFUN), and activate only the experts you need (DOT-MoE). For builders, this translates into lower latency, lower token or parameter use, and better task success on long or open-ended inputs. ²

Two practical dials to watch are time-to-first-token and active-parameter count: AdaCodec reports 9.26s to 1.62s time-to-first-token on general-video tasks, and DOT-MoE reports 50% fewer active parameters while keeping roughly 90% quality. These are direct levers on user experience and cloud cost. ¹

This Week, Try It

AdaCodec in 3 minutes: skim Figures and the predictive visual code section on arXiv to see how P-tokens summarize changes. https://arxiv.org/abs/2606.02569v1
Watch AFUN’s project clips: see masks and 3D motion curves on everyday objects. https://www.zhaoningwang.com/AFUN

Sources 3

[1] Arxiv AdaCodec: A Predictive Visual Code for Video MLLMs [2] Arxiv AFUN: Towards an Affordance Foundation Model for Functionality Understanding [3] Arxiv DOT-MoE: Differentiable Optimal Transport for MoEfication

Helpful?

0to1log Weekly

Latest AI News