AI NewsResearch

6 min read 5/18/2026

RLVRReinforcement LearningOmni-modalVisual reasoningAgentsmacOS

LLMs learn from 128 random demos with a new reward-learning method

A few-shot reinforcement approach matches full-data baselines with just 128 examples, while a cleaned omni-modal benchmark clarifies real gains — and a macOS app packages local AI agents for everyday use.

Find in this article

Reading Mode

One-Line Summary

Today’s research shows how to do more with less: a few-shot reinforcement method trains large models using 128 random examples, omni‑modal evaluation is cleaned to curb visual shortcuts, and a macOS app makes local AI agents practical.

Research Papers

FEST: Few-shot demos boost verifiable-reward RL for LLMs

This work proposes a way to train models for math and coding by using just 128 randomly selected correct examples to guide a reinforcement loop that can be automatically checked for correctness. The approach builds on Reinforcement Learning with Verifiable Rewards (RLVR), a setup where answers can be programmatically verified (e.g., unit tests or numeric checks) but where generating correct chain-of-thought rollouts is often sample-inefficient. ¹

The method, called FEST (FEw-ShoT demonstration-guided RLVR), combines three ingredients: a supervised signal, an on‑policy signal, and decaying weights on the small few‑shot dataset to prevent overfitting across multi‑epoch training. In plain terms, it lightly anchors learning to a handful of known‑good solutions while still letting the model explore and improve through verifiable rewards. ¹

On several benchmarks, FEST outperforms baselines using far less supervised fine‑tuning data and can match runs that rely on the full dataset. This points to a practical path for cutting data collection costs without sacrificing accuracy on hard problems. ¹

To support replication, the authors share rollout evaluation data on Hugging Face, which helps teams reproduce results and examine how guidance affects chain‑of‑thought trajectories. ²

OmniBoost + OmniClean: staged post-training on a de-biased omni-modal benchmark

This study asks a simple question: are omni‑modal gains real integration of audio‑vision‑language, or artifacts of visual shortcuts? The authors probe nine benchmarks with visual‑only tests, filter out items solvable from images alone, and produce OmniClean: 8,551 retained queries from 16,968 audited — a clearer evaluation set for genuine multi‑modal reasoning. ³

They then train OmniBoost (based on Qwen2.5‑Omni‑3B) with a three‑stage recipe: mixed bi‑modal supervised fine‑tuning (SFT), mixed‑modality RLVR, and SFT on self‑distilled data. Balanced bi‑modal SFT brings limited, uneven gains; mixed‑modality RLVR yields the first broad lift; self‑distillation reshapes the profile. After the final stage, the 3B model reaches performance comparable to — slightly above in aggregate — Qwen3‑Omni‑30B‑A3B‑Instruct, without a stronger teacher. The team also provides evaluation assets on Hugging Face to aid reproducibility. ³

ViMU: testing video models on metaphor, irony, and social meaning

Most video AIs excel at literal recognition (objects, actions), but struggle with subtext — humor, irony, or cultural cues. ViMU introduces a benchmark that evaluates whether models can infer implicit meaning while grounding interpretations in audio‑visual‑language evidence, with both open‑ended and multiple‑choice questions designed to be hint‑free. ⁴

Related work on “listening to the unspoken” in interviews shows how multi‑modal signals (video, audio, text) can be fused and scored across multiple responses and dimensions; one framework reports a multi‑dimensional average MSE of 0.1824 and first place in the AVI Challenge 2025, underscoring growing interest in robust subtext assessment. ⁵

ATLAS: one functional token for agentic and latent visual reasoning

ATLAS proposes a simple unit — a single discrete “functional token” — that stands in for a visual operation. Because the token is part of the normal tokenizer, models can do visual reasoning via next‑token prediction without generating heavy intermediate images or calling external tools. The paper also introduces Latent‑Anchored GRPO (LA‑GRPO), an update to Group Relative Policy Optimization (GRPO) that anchors these tokens with an auxiliary objective to stabilize reinforcement learning. ⁶

The result is a training path compatible with standard supervised fine‑tuning and reinforcement learning (no architectural changes), offering clearer interpretability and fewer slow context switches typical of tool‑based agents. A separate position paper, BetaWeb, explores how a blockchain‑enabled “Agentic Web” could provide identity, auditability, and incentives for autonomous agents at Internet scale. ⁶ ⁷

Open Source & Repos

Osaurus: a native macOS harness for local and cloud AI agents

Osaurus is an open-source, Swift-based macOS app that lets you run AI agents with local models or connect to cloud providers — while keeping tools, files, and memory on your own machine. It supports Apple’s on‑device models, Liquid AI’s LFM family, and cloud backends like OpenAI, Anthropic, Gemini, xAI/Grok, Venice AI, OpenRouter, Ollama, and LM Studio. As a full Model Context Protocol (MCP) server, it can expose your tools to any MCP‑compatible client, and it ships with 20+ native plug‑ins (Mail, Calendar, Browser, Filesystem, Git, and more). ⁸

A recent profile notes more than 112,000 downloads, a hardware‑isolated sandbox for safety, and new voice capabilities. To run local models, the team suggests at least 64 GB RAM (about 128 GB for larger ones). Osaurus positions itself as a consumer‑friendly alternative to developer‑centric harnesses, while remaining fully offline and open source. ⁹

Why It Matters

Small, well‑chosen guidance can now stretch reinforcement learning much further: a 128‑example few‑shot set steers verifiable‑reward training to match full‑data results. At the same time, cleaning omni‑modal evaluation (8,551 of 16,968 items retained) focuses progress on true cross‑modal reasoning rather than visual shortcuts. Together, these shifts encourage teams to invest in smarter supervision and stronger measurement instead of just bigger datasets. ¹ ³

On the tools side, packaging agents into a local macOS harness like Osaurus gives non‑developers a safer way to try local models and mix them with cloud systems — useful for privacy‑sensitive tasks and offline workflows. It signals a broader move toward flexible, user‑controlled AI stacks that can run where data lives. ⁸ ⁹

This Week, Try It

Osaurus on your Mac: download the latest release, enable a local model, and attach a plug‑in like Browser or Filesystem to see tool‑use in action. ⁸
Read OmniBoost + OmniClean: skim the paper’s filtering method and three‑stage post‑training recipe; compare the 8,551 vs. 16,968 item counts to see how debiasing changes evaluation. ³

At a Glance

Today's Quiz

What is the high-level mechanism FEST uses to improve sample efficiency in verifiable-reward RL, according to the digest?

Sources 11

[1] Arxiv Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance [2] Huggingface shutingmi/rollout_eval_data_20260515_144708 [3] Arxiv Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation [4] Huggingface VmaxRL/swebench-pro-public-dev-eval-v0-image-refs-20260516 [5] Huggingface soonweihong0857/so101_smolvla_task_20260517_005923 · Datasets at Hugging Face [6] Arxiv ViMU: Benchmarking Video Metaphorical Understanding [7] Arxiv Listening to the Unspoken: Exploring “365” Aspects of Multimodal Interview Performance Assessment [8] Arxiv ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both [9] Arxiv BetaWeb: Towards a Blockchain-enabled Trustworthy Agentic Web [10] Github osaurus-ai/osaurus: Own your AI. The native macOS harness for AI agents -- any model, persistent memory, autonomous exec [11] Techcrunch Osaurus brings both local and cloud AI models to your Mac

Helpful?

0to1log Weekly

Latest AI News