AI NewsResearch

7 min read 5/5/2026

Reinforcement LearningVision-Language ModelsMoE routingMultimodal benchmarksChain-of-ThoughtMMMU

RL pushes vision-language agents past 100 steps; new tests expose weak spots

Odysseus trains a multimodal agent to make 100+ decisions in Super Mario Land and goes at least 3× farther than prior agents. Meanwhile, open models scale on tough exams and fresh benchmarks stress-test video lectures and visual honesty.

Find in this article

Reading Mode

One-Line Summary

Long-horizon reinforcement learning turns multimodal models into game-playing agents while new multimodal benchmarks reveal gaps in execution, factuality, perspective, and honesty.

Research Papers

Odysseus trains vision-language agents to plan 100+ steps in Super Mario Land

Odysseus is a training framework that teaches a vision-language model to play Super Mario Land over 100+ turns, showing at least 3× more average game progress than prior frontier models. It adapts proximal policy optimization with a lightweight turn-level critic to stabilize long-horizon reinforcement learning and outperforms critic-free methods like GRPO and Reinforce++ on both stability and sample efficiency. ¹

Starting from a pretrained VLM provides strong action priors, cutting down on manual action engineering and improving efficiency during RL. The trained agent generalizes across game levels and even to other games, while keeping its general-domain capabilities — a signal that long-horizon, multimodal RL can scale without sacrificing breadth. The authors release Odysseus as an open training framework. ¹

In parallel, open-source multimodal models scale up: InternVL 2.5 becomes the first public MLLM to pass 70% on MMMU, reaching 70.1% with chain-of-thought (a 3.7-point gain over direct answers). The team shows that larger vision encoders can cut token needs (reporting comparable performance with roughly one-tenth training tokens versus a 600M-encoder baseline) and that careful data curation and test-time scaling matter. A live demo and weights are available. ²

Fresh evaluations underscore where models still stumble. Video-MMLU compiles 1,065 lecture videos across math, physics, and chemistry and finds even strong models sit between roughly 10% and 50% accuracy on tasks that demand both perception and reasoning — a reminder that classroom-grade comprehension is still hard. ³

MASCing steers MoE experts to reconfigure model safety

MASCing gives developers a knob to re-route which experts in a Mixture-of-Experts (MoE) model are activated at inference time — without retraining — to meet different safety goals. An LSTM-based surrogate predicts how routing choices map to downstream behavior, then a learned steering matrix applies masks to the router so specific expert circuits are boosted or suppressed. On multi-turn jailbreak defense, the average defense success rate jumps from 52.5% to 83.9% (with gains up to 89.2%); for adult-content generation, success rises from 52.6% to 82.0% (up to 93.0%) across seven open-source MoE models, with negligible overhead. ⁴

Why this matters: sparse-activation MoE models can behave differently across scenarios because routing, not just weights, shapes outputs. MASCing offers a practical way to reconfigure safety behavior per use case while preserving general utility — a middle path between rigid blocking lists and costly fine-tunes. ⁴

Trust, meanwhile, must be measured. MoHoBench introduces 12k+ unanswerable visual questions spanning four types (e.g., false premises, context dependent) to assess whether multimodal models honestly refuse when the image is insufficient. Benchmarking 28 MLLMs shows most fail to decline appropriately, and honesty depends on the visual input — motivating multimodal-specific alignment via SFT and DPO baselines. ⁵

At the system level, RouteMoA points to a complementary control lever: instead of running all models then choosing, it predicts which models to consult before inference in a 15-model pool, cutting compute by 89.8% and latency by 63.6% while maintaining higher accuracy; on about 98% of queries, the top-3 contained the critical models. ⁶

Stable-GFlowNet stabilizes red-teaming to find diverse attacks

Stable-GFlowNet (S-GFN) is a red-teaming method that generates diverse, high-impact attack prompts more reliably by removing the brittle partition-function (Z) estimation that makes standard Generative Flow Networks unstable. It replaces Z with pairwise comparisons, adds robust masking to handle noisy rewards, and uses a “fluency stabilizer” to avoid degenerate, gibberish prompts — improving both attack success and diversity across settings. ⁷

The takeaway: effective safety evaluation needs breadth, not just the single “nastiest” prompt. By preventing mode collapse during attack generation, S-GFN finds more varied failure cases and yields a stronger picture of where a model breaks. ⁷

Related work on the defense side explores combining model understanding with behavior signals: Trident studies malware detection with LLM-derived and behavioral features, reflecting a broader push to pair semantic modeling with execution traces. ⁸

LLMs skip steps on long procedures; new tests probe factuality and instruction following

A diagnostic study shows that when asked to execute a step-by-step arithmetic algorithm, many LLMs fail to faithfully run the full procedure as it grows longer: across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step programs to 20% on 95-step programs, with frequent issues like premature answers, under-executed traces, and self-corrections after initial errors. ⁹

For video understanding, Video SimpleQA targets factuality — short, definitive answers backed by external sources — and evaluates 41 LVLMs. Even the top model, Gemini-1.5-Pro, reaches only a 54.4% F-score; test-time compute gives minor gains, while retrieval-augmented generation improves scores at an inference-time cost. ¹⁰

On instruction following, MM-IFEngine builds a 23k-sample multimodal dataset (for SFT and DPO) and a stricter MM-IFEval benchmark with both format constraints and perception-linked requirements. Fine-tuning Qwen2-VL-7B on MM-IFDPO-23k boosts scores by +10.2%, +7.6%, and +12.3% on MM-IFEval, MIA, and IFEval; on MM-IFEval, GPT-4o scores 64.6% while Qwen2-VL-72B reaches 50.8%, leaving ample headroom. ¹¹

Geometry is another blind spot: MMPerspective assembles 2,711 images and 5,083 QA pairs to test vanishing points, line relations, and perspective robustness. Across 43 MLLMs, models do better on surface perception but falter on compositional reasoning and under simple perspective-preserving edits; chain-of-thought helps but does not close the gap. ¹²

Why It Matters

Today’s thread is control plus coverage: practical RL shows how to push agents through 100+ decisions, while steering (MASCing) and sharper evaluations (Video-MMLU, Video SimpleQA, MM-IFEval, MMPerspective) expose and tune what still fails — a useful map if you’re deciding where to invest in reliability versus capability next. ¹

이번 주 시도해볼 것

InternVL 2.5 demo: Try the live Hugging Face Space to see open multimodal reasoning in action — https://huggingface.co/spaces/OpenGVLab/InternVL
ColorBench sampler: Browse the tasks and repo to see where color trips up VLMs — https://github.com/tianyi-lab/ColorBench

Sources 13

[1] Arxiv Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning [2] Arxiv Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [3] Arxiv ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness [4] Arxiv Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark [5] Arxiv MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks [6] Arxiv MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions [7] Newsglobenow Shanghai Jiao Tong University Proposes RouteMoA for AI Agents [8] Arxiv Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance [9] Databubble Trident: Improving Malware Detection with LLMs and Behavioral Features · via arxiv - Databubble [10] Arxiv When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models [11] Arxiv Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [12] Arxiv MM-IFEngine: Towards Multimodal Instruction Following [13] Arxiv MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Helpful?

0to1log Weekly

Latest AI News