AI NewsResearch

6 min read 3/25/2026

multimodal-llmspeculative-decodingdiffusion-modelsocroptical-flowdeveloper-tools

Speculative planning accelerates agentic multimodal LLMs by up to 3.35× without accuracy loss

A new agent-level speculation layer cuts the serial tool-use bottleneck in vision-language agents, while diffusion models reshape OCR and robust optical flow. Plus: an agent-native Lark/Feishu CLI for 200+ workflows.

Find in this article

Reading Mode

One-Line Summary

Agentic multimodal systems get a blueprint for real-time speed: speculative planning cuts latency up to 3.35×, while diffusion reshapes OCR and robust optical flow.

Research Papers

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Agentic multimodal large language models (MLLMs) like OpenAI o3 rely on chains of visual tools (crop, zoom, OCR) that must run step by step. SpecEyes breaks this serial “agentic depth” by letting a lightweight, tool-free model speculate on the whole plan first, often answering immediately and skipping the heavy tool loop. In benchmarks (V* Bench, HR-Bench, POPE), it reports end-to-end speedups from 1.1× to 3.35× while preserving or even improving accuracy byup to +6.7%. In plain terms: many queries don’t need the slow toolbox, and SpecEyes confidently returns fast answers for those. ¹ ² ³

The core idea has three moving parts. First, a small MLLM acts as a speculative planner to predict whether tools are necessary, answering directly when possible. Second, a label-free confidence test—called an “answer separability” gate—measures how clearly the model prefers its top answer using its top-K logits, so it can self-verify without ground-truth labels. Third, a heterogeneous parallel funnel runs the small model at high concurrency to mask the large model’s slower, stateful execution. Together, this design boosts throughput under concurrent workloads without retraining the big model. ¹ ²

SpecEyes also formalizes where the gains come from using two rates: $\beta$ , the fraction of queries that are truly tool-free, and $\alpha$ , the gate’s acceptance rate among those; when $\beta$ is high and the gate is calibrated, the system hides most of the serial latency behind parallel small-model screening. Experiments use Qwen3-VL-2B as the small model and two agentic backbones (DeepEyes, Thyme), showing generalization: with Thyme, average speedup is ~1.4× with a slight accuracy bump. Code, evaluation scripts, and confidence analyses are released to ease replication. ¹ ³ ⁴

From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

This survey reframes LLM agent systems as agentic computation graphs: nodes (LLMs, tools, memory) and edges (data/control flow). It distinguishes static scaffolds—fixed, reusable workflows—from dynamic structures that are selected or edited per run, and organizes prior work by when structure is decided, which parts are optimized (nodes vs. graph), and what signals guide optimization (task metrics, verifiers, preferences, trace feedback). This gives builders a vocabulary to separate templates, realized graphs, and execution traces. ⁵ ⁶

A key takeaway: static optimization shines when the operator space is constrained, evaluation is reliable, and workloads are repetitive; in such settings, offline search (e.g., Monte Carlo Tree Search) can find templates that beat on-the-fly designs due to lower runtime cost and easier debugging. When tools or environments drift, dynamic methods with higher “graph plasticity” matter—pre-run selection or in-execution editing can adapt structures input-by-input. The survey also promotes structure-aware evaluation beyond task scores, including graph properties, cost, robustness, and structural variation. ⁵ ⁷

For practitioners comparing prompt tuning to graph-level optimization: the survey argues graph-level changes can unlock larger gains when bottlenecks stem from control flow, coordination, or verification rather than instruction wording. It consolidates baselines and calls for reproducible standards so results are comparable across frameworks. ⁵ ⁶

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Traditional OCR decodes characters left-to-right, which is slow and prone to cascading errors on long pages with tables and formulas. MinerU-Diffusion flips the script: it treats OCR as “inverse rendering,” asking what text would produce the image, and decodes in parallel using diffusion denoising. The result is up to 3.2× faster decoding with improved robustness, especially on complex layouts where sequential mistakes would otherwise compound. ⁸ ⁹

Technically, it uses a block-wise diffusion decoder coupled with an uncertainty-driven curriculum to stabilize training and scale to long sequences. A new “Semantic Shuffle” benchmark stresses reduced dependence on language priors, showing the model relies more on the pixels than guessable text patterns—a good sign for math and table-heavy documents. Parallel denoising avoids the token-by-token latency of autoregressive decoders. ⁸ ¹⁰

Why it matters: document AI often dies by a thousand cuts—one mis-read cell corrupts a whole table. By decoding globally and in parallel, diffusion-based OCR contains local errors and speeds up high-throughput pipelines like invoice ingestion or scientific PDF parsing. Coverage on developer platforms highlights interest from practitioners handling messy real-world docs. ⁸ ¹¹

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

In the wild, videos are blurry, noisy, and compressed—conditions where standard optical flow models stumble. DA-Flow taps into the internal features of image restoration diffusion models, which are naturally “corruption-aware,” and adds full spatio-temporal attention to make them “motion-aware.” These diffusion features are fused with CNN features in an iterative refinement loop, yielding state-of-the-art performance on degraded inputs. ¹² ¹³

On Sintel and Spring, DA-Flow cuts End-Point Error (EPE) substantially—e.g., Spring EPE 2.207 vs. a2.703 best baseline—and improves outlier rates across 1px/3px/5px thresholds. On TartanAir it shows lower outlier rates but a slightly higher EPE (8.866) than FlowSeek (7.694), implying more consistent flow with a few large misses. Ablations confirm the lifted diffusion features with spatio-temporal attention are key to the gains. ¹³ ¹⁴

Practically, this suggests a recipe for robust motion under crap-quality video: borrow generative priors from restoration models, add temporal attention, and fuse with discriminative encoders. For robotics, surveillance, and autonomous driving in adverse weather, being degradation-aware can be the difference between usable and unusable flow. ¹² ¹⁵

Open Source & Repos

larksuite/cli: The official Lark/Feishu CLI for humans and AI agents

This Go-based CLI covers core Lark/Feishu domains—Messenger, Docs, Base, Sheets, Calendar, Mail, Tasks, Meetings—with 200+ commands and19 AI Agent Skills. It’s designed “agent-native,” so scripts and LLM agents can automate org workflows via a consistent interface. The project is MIT-licensed and supports Go**>=1.23** and npm distribution. ¹⁶

Adoption signals: about 4.8k GitHub stars,225 forks, and active development (last commit ~1 day ago; repo created ~7 days ago). Trend dashboards echo recent activity even if it hasn’t hit Trending yet, suggesting strong initial community pull from internal automation and agent builders. ¹⁷ ¹⁸

Why it matters: enterprise collaboration stacks are where many AI agents live day-to-day. Having a single CLI that unifies messaging, docs, and calendar with agent-friendly skills lowers the glue-code burden and makes it easier to prototype “compound AI” systems that act across multiple apps. ¹⁶

Why It Matters

Across today’s items, the theme is structural efficiency: SpecEyes accelerates entire agentic loops, MinerU-Diffusion parallelizes OCR decoding, and DA-Flow reuses restoration priors for robust motion. These are system-level moves that reduce latency or failure cascades without simply scaling model size. ¹ ⁸ ¹²

As LLM agents proliferate, the survey’s static-vs-dynamic lens helps teams pick when to fix a template and when to let the graph adapt at runtime. Meanwhile, infra like larksuite/cli hints at where agents will actually operate: embedded in the enterprise stack, coordinating tools with predictable interfaces. ⁵ ¹⁶

Sources 19

[1] Arxiv SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning [2] Arxivlens SpecEyes - ArxivLens analysis [3] Chatpaper SpecEyes - ChatPaper [4] Github SpecEyes GitHub repository [5] Arxiv From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents [6] Liner Survey quick review - Liner [7] Chatpaper Survey - ChatPaper [8] Alphaxiv Survey - alphaXiv overview [9] Arxiv MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding [10] Alphaxiv MinerU-Diffusion - alphaXiv [11] Alphaxiv MinerU-Diffusion - alphaXiv CN overview [12] Daily Daily.dev post on MinerU-Diffusion [13] Arxiv DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models [14] Liner DA-Flow quick review - Liner [15] Gist DA-Flow - Gist.Science [16] Chatpaper DA-Flow - ChatPaper [17] Github larksuite/cli GitHub [18] Trendshift Trendshift - larksuite/cli [19] Github larksuite/cli activity

Helpful?

0to1log Weekly

Latest AI News