AI NewsResearch

7 min read 5/7/2026

LLM reinforcement learningSpeculative decodingRetrievalRed teamingVideo generationAgentic workflows

A new playbook for training reasoning AIs — and a 3x Gemma speed boost

Two surveys codify how to design and govern the data flows behind RL-tuned reasoning models and evolving agent skills, while Google ships multi‑token prediction to speed Gemma 4 and developer webhooks for long jobs.

Find in this article

Reading Mode

One-Line Summary

Today’s research shifts focus from bigger models to better pipelines — codifying RL rollouts and dynamic skill libraries — while Google makes Gemma 4 up to 3x faster and adds tooling for long-running agent jobs.

LLM & SOTA Models

Gemma 4 gets up to 3x faster with multi-token prediction

Google releases small “drafter” models that guess several next words at once so the main Gemma 4 model can approve them in bulk, cutting response time. This approach, called multi-token prediction with speculative decoding, delivers up to a 3x speedup with no degradation in output quality, in part by sharing the target model’s key–value (KV) cache; the drafters ship under the Apache 2.0 license and run with common stacks like Transformers, MLX, vLLM, SGLang, and Ollama. Google also notes hardware tips such as batching 4–8 requests on Apple Silicon for its 26B mixture‑of‑experts model to see about a 2.2x local speedup. ¹

For developers building agent workflows that can take minutes or hours, Google adds event‑driven webhooks to the Gemini API — a push system that replaces wasteful polling. Requests are signed, follow the Standard Webhooks spec, guarantee at‑least‑once delivery with automatic retries for up to 24 hours, and can be configured globally or per request using HMAC or JWKS, reducing operational friction for long‑running jobs. ²

Google also updates AI Mode and AI Overviews in Search to surface sources more clearly. New features include suggested next steps, inline links beside relevant sentences, highlights from your news subscriptions, previews of community perspectives, and on‑hover website previews, underpinned by techniques like query fan‑out to reach deeper into the web. ³

Research Papers

GFCR: a playbook for RL rollouts in reasoning LLMs

This survey explains how to design the “rollouts” — the step‑by‑step paths an AI takes from prompt to final answer — that reinforcement learning (RL) systems use to improve large language model (LLM) reasoning. It introduces a four‑stage lifecycle called Generate–Filter–Control–Replay (GFCR): Generate proposes candidate trajectories and structures, Filter builds intermediate signals using verifiers or judges, Control allocates compute and decides when to continue, branch, or stop under budgets, and Replay reuses artifacts and curricula without weight updates. ⁴

Beyond the pipeline, the authors define criteria of reliability, coverage, and cost sensitivity to describe trade‑offs, and synthesize practices spanning verifiable rewards, process supervision, judge‑based gating, tree/segment rollouts, adaptive compute, early‑exit/partial rollouts, throughput optimization, and replay/recomposition. Case studies in math, code/SQL, multimodal reasoning, and tool‑using agents anchor the taxonomy, and a diagnostic index maps common rollout failures to GFCR modules with mitigation levers. ⁴

A companion view from “They Are Not Static: A Survey of Dynamic Agent Skills” treats agent skills as evolving artifact libraries — code, procedures, SKILL.md packages, graphs, or adapters — and formalizes each skill as a seven‑tuple (applicability, policy, termination, interface, edit, verification, lineage). At the library level, a ten‑operator algebra (ADD, REFINE, MERGE, SPLIT, PRUNE, DISTILL, ABSTRACT, COMPOSE, REWRITE, RERANK) organizes a 94‑paper audit set and highlights that verifier quality and admission/repair policies often matter more than raw skill count, while flat retrieval degrades as libraries grow. ⁵

Taken together, the two surveys shift attention from “which RL optimizer” to “what data gets generated, filtered, scheduled, and reused — and how evolving skills are verified and governed,” offering a blueprint for reproducible, compute‑aware, and trustworthy rollout design. ⁵

Predictive Video VAE improves generation by learning to foresee motion

This work, Predictive Video VAE (PV‑VAE), trains a video autoencoder to reconstruct visible frames and predict future ones at the same time, so its latent space better captures motion and temporal structure. On UCF101, PV‑VAE reports 52% faster convergence and a 34.42 FVD (Fréchet Video Distance) improvement over the Wan2.2 VAE, indicating higher‑quality generations at lower training cost. ⁶

The key idea is a predictive reconstruction objective: randomly drop future frames, encode only partial past observations, and train the decoder to both rebuild observed frames and predict the missing future. This encourages temporally predictive latents that improve downstream generative modeling. ⁶

Analyses suggest scalability — generative performance continues to improve as the VAE trains — and consistent gains on video understanding tasks, a sign that the learned latents capture temporal coherence and motion priors useful beyond generation. ⁶

Agentic red teaming goes from weeks to hours with automation

An agentic red‑teaming framework built on the open‑source Dreadnode SDK lets operators describe goals in natural language and automatically assembles attack workflows, compressing weeks of manual library wiring into hours. It unifies 45+ adversarial attacks, 450+ transforms, and 130+ scorers, and in a case study on Meta Llama Scout achieves an 85% attack success rate (severity up to 1.0) using zero human‑written code. ⁷

ARIS (Auto‑Research‑in‑sleep) complements this by providing an open research harness for long‑horizon ML workflows with cross‑model adversarial collaboration by default: one model executes while a different‑family reviewer critiques artifacts and requests revisions. The system includes 65 reusable Markdown‑defined skills, Model Context Protocol (MCP) integrations, a persistent research wiki, deterministic figure generation, five end‑to‑end workflows, and an assurance layer with integrity checks, result‑to‑claim mapping, claim auditing, multi‑pass scientific editing, mathematical‑proof checks, and visual PDF inspection. ⁸

Together these approaches address a central failure mode — “plausible success” without adequate evidence — by tying claims to auditable artifacts and letting operators focus on what to probe rather than how to implement the probing workflow. ⁷

Reasoning‑intensive retrieval and training data for search agents

BRIGHT‑Pro expands evaluation for reasoning‑intensive retrieval with expert‑annotated, multi‑aspect gold evidence and protocols that test retrievers in both static and agentic search. The authors also build RTriever‑Synth — an aspect‑decomposed synthetic corpus with complementary positives and conditioned hard negatives — and LoRA fine‑tune RTriever‑4B from Qwen3‑Embedding‑4B, showing substantial gains over the base model under aspect‑aware metrics. ⁹

In parallel, OpenSeeker‑v2 shows that simple supervised fine‑tuning (SFT) on only 10.6k high‑difficulty trajectories can train a 30B ReAct‑style agent to state‑of‑the‑art results: 46.0% on BrowseComp, 58.1% on BrowseComp‑ZH, 34.6% on Humanity’s Last Exam, and 78.0% on xbench — surpassing Tongyi DeepResearch’s heavier CPT+SFT+RL pipeline (43.4%, 46.7%, 32.9%, 75.0%). The team emphasizes that the model weights are released by an academic group using SFT alone. ¹⁰

The takeaway: better evaluation (multi‑aspect evidence, agentic protocols) and targeted data (harder trajectories, richer tools) can rival industrial pipelines, and aspect‑aware metrics reveal behaviors that standard relevance scores might hide. ⁹

Why It Matters

Across today’s items, progress comes from shaping processes — how rollouts are generated, filtered, and allocated, how skills are admitted and repaired, and how evidence is retrieved — not just from scaling parameters. That reframes “training” as lifecycle management of data, verifiers, and compute. ⁴

For practitioners, two practical dials stand out: assurance (verifier quality, admission gates, claim‑to‑evidence mapping) and latency (tokens per second). Multi‑token prediction offers a concrete lever on the latter, while GFCR‑style pipelines make the former measurable. ¹

This Week, Try

Multi‑token prediction on Gemma 4: run a drafter+target pair with your favorite runtime (Transformers/MLX/vLLM/Ollama) using Google’s guide. ¹
Gemini API webhooks: set up a global webhook and verify signed deliveries to replace polling in long jobs. ²

Sources 11

[1] Blog Multi-token-prediction in Gemma 4 [2] Arstechnica Google’s Gemma 4 AI models get 3x speed boost by predicting future tokens [3] Blog Event-Driven Webhooks in the Gemini API [4] Arxiv Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning [5] Openreview They Are Not Static: A Survey of Dynamic Agent Skills [6] Arxiv Video Generation with Predictive Latents [7] Arxiv ARIS (Auto-Research-in-sleep): an open-source research harness for autonomous research [8] Arxiv Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours [9] Arxiv Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems [10] Arxiv OpenSeeker-v2: Training a frontier search agent via simple SFT [11] Blog How AI Mode and AI Overviews help you explore the web

Helpful?

0to1log Weekly

Latest AI News