Vol.01 · No.10 Daily Dispatch May 31, 2026

Latest AI News

AI · PapersDaily CurationOpen Access
AI NewsResearch
6 min read

Function-calling helps large language models fix their own prompts, boosting reasoning by up to 12.9 points

A new workflow turns models into their own prompt engineers by running full-set diagnostics and iterating on instructions. Alongside it, fresh papers push memory-based reasoning and slash video generation memory costs.

Reading Mode

One-Line Summary

Researchers automate prompt editing with full-set diagnostics, while separate work unlocks in-model working memory and compresses video attention caches for longer, faster generation.

Research Papers

Reflective Prompt Tuning uses function calls to audit and fix prompts

Instead of hand-tuning instructions example by example, Reflective Prompt Tuning (RPT) runs a model-wide checkup that summarizes recurring mistakes and then edits the prompt accordingly — using language model function-calling to automate the loop. In tests across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points and also strengthens confidence calibration. 1

How it works: an optimizer large language model (LLM) calls a diagnostic function that evaluates the target model on an entire optimization set, produces a structured failure report, and appends it to a memory of prior diagnostics. The optimizer then rewrites the prompt for the next round using this accumulated history, enabling targeted, history-aware edits rather than one-off tweaks. 1

Beyond accuracy, RPT supports confidence-aware optimization — it uses calibration signals both in feedback and in the final prompt selection. Analyses show the method is especially effective on multi-hop and mathematical reasoning, where the diagnosed failure patterns closely match the actual prompt revisions that drive gains. 1

LLMSurgeon estimates a model’s training data mix from its outputs

LLMSurgeon tackles a simple but important question: from only a model’s generated text, can we infer what domain mix (for example, news, code, or encyclopedias) it was pretrained on? The paper formalizes this as Data Mixture Surgery (DMS) — estimating the domain-level distribution of a large language model’s pretraining corpus without access to the data itself. 2

Technically, LLMSurgeon treats DMS as an inverse problem under a label-shift assumption, estimating a calibrated soft confusion matrix to correct systematic domain confusion and recover the latent mixture prior. The authors also introduce LLMScan, a recipe-verifiable evaluation suite built from open-source models with known mixtures, and report high-fidelity recovery under fixed protocols — a practical path to post-hoc auditing of a model’s “digital DNA.” 2

Reasoning in Memory replaces token-by-token thoughts with fixed memory blocks

Reasoning in Memory (RiM) lets models do latent reasoning without emitting long chains of intermediate tokens: it allocates fixed “memory blocks” — sequences of special tokens — that the model processes in a single forward pass, then refines the final answer after each block. This decouples internal computation from autoregressive generation, aiming for compute-efficient reasoning. 3

RiM trains in two stages: first, it grounds the memory blocks by predicting explicit reasoning steps after each block; second, it drops step-level supervision and instead iteratively improves only the final answer. Across model families and sizes, RiM matches or beats existing latent reasoning methods while avoiding the cost of generating thoughts token by token. 3

VideoMLA compresses attention cache for long video diffusion

VideoMLA rethinks how the key-value (KV) cache is stored in streaming video diffusion by replacing per-head keys and values with a shared low-rank content latent and a shared decoupled 3D rotary position embedding (RoPE) key. This Multi-Head Latent Attention (MLA) design cuts per-token KV memory by 92.7% at every cached layer. 4

On VBench, VideoMLA matches strong short-horizon baselines and achieves the best overall score at longer horizons among evaluated methods, while improving throughput by 1.23× on a single B200 GPU. Interestingly, the authors find pretrained video attention is not low-rank (its 99%-energy effective rank exceeds practical latent sizes), and argue the MLA bottleneck — not the pretrained spectrum — sets the effective rank budget during training. 4

Open Source & Repos

oh-my-pi: a terminal-first AI coding agent with LSP integration

oh-my-pi is a command-line coding agent wired like an IDE: it performs hash-anchored edits, orchestrates an optimized tool harness, speaks the Language Server Protocol (LSP), drives a browser, and can spawn subagents — all from the terminal. It targets developers who want agentic code edits without jumping to web UIs. 5

The latest release notes (v15.6.0) mention a fix for Anthropic adaptive-thinking tool-use flows that previously triggered “thinking blocks in the latest assistant message cannot be modified” 400 errors — a sign the project is actively smoothing agent-tool interactions. Installation is available via npm under @oh-my-pi/pi-coding-agent. 5

Why It Matters

Automating prompt engineering with full-set diagnostics (RPT) points toward safer, cheaper adaptation of existing large language models: teams can iterate on instructions using failure patterns and calibration signals instead of spinning new fine-tunes. In parallel, compute-efficient latent reasoning (RiM) suggests models can “think internally” without paying the cost of long chain-of-thought outputs. 13

On the generative media front, compressing the attention cache for video diffusion (VideoMLA) reduces memory and latency pressure, opening room for longer rollouts on fixed hardware budgets. And on the tooling side, terminal-native coding agents like oh-my-pi show how agent workflows are moving into everyday developer environments rather than niche demos. 45

Sources 5

Helpful?

Comments (0)