Vol.01 · No.10 Daily Dispatch March 26, 2026

Latest AI News

AI · PapersDaily CurationOpen Access
AI NewsResearch
6 min read

Google's TurboQuant slashes LLM KV cache 6x without quality loss; NVIDIA ships 120B MoE that runs like 12B

Lossless KV-cache quantization and hybrid MoE backbones are redefining AI efficiency: cheaper context, longer memories, and real throughput gains you can deploy today.

Reading Mode

One-Line Summary

Google debuts TurboQuant for lossless AI memory compression while NVIDIA and Microsoft ship efficiency-focused models and Google expands Lyria to full song length.

LLM & SOTA Models

Google Lyria 3 Pro: From 30 Seconds to 3-Minute Songs

Think of Lyria 3 Pro as moving from jingles to full tracks: instead of topping out at 30 seconds, it now generates songs up to 3 minutes, with prompts that can specify intros, verses, choruses, and bridges—useful for creators who need structure, not just texture. Access rolls out to the Gemini app for paid subscribers, plus integrations inGoogle Vids andProducerAI, and developer access viaVertex AI ** (public preview),Gemini API, andAI Studio**. 1 2 3

Google stresses training on partner data and permissible YouTube/Google sources, and says it won’t mimic a specific artist; if an artist name appears in the prompt, the model only takes “broad inspiration.” To aid provenance, all outputs carry SynthID watermarks. This pushes toward responsible generative audio as platforms like Spotify and Deezer add tools to flag or review AI-generated music. 1 2 3

Why it matters: longer duration plus section-level control moves AI music from “sound demo” to “song draft,” and enterprise availability (Vertex AI, API) means it can slot into ad-tech and video workflows quickly. The next watch items are latency, style diversity at 3 minutes, and how watermarking and attribution guardrails interplay with creator tools. 1 3

NVIDIA Nemotron 3 Super: 120B Capacity, 12B Active for Agentic Reasoning

Nemotron 3 Super targets multi-agent “context explosion” with a 1M-token context window and a hybrid backbone:Mamba-2 layers for linear-time sequence handling, interleavedTransformer attention for precise recall, andlatent Mixture-of-Experts ** (MoE) that activates about12B** of120B total parameters per token. NVIDIA cites up to5× throughput over the prior Super, nativeNVFP4 precision optimized for Blackwell GPUs, and**multi-token prediction ** (MTP) for faster, more coherent long-form generation. Weights, datasets, and recipes are open. 4 5

Third-party write-ups underscore practicalities: active params 12.7B (of120.6B),512 experts with22 active per pass, and reported throughput of478 tok/s on B200—about7.5× Qwen3.5‑122B under those conditions. It can run at4-bit ** (Q4) on a singleH100‑80GB** for moderate contexts, though pushing to the full1M context generally needs H200 or multi‑GPU. Benchmarks note strong showings onAIME 2025,SWE‑Bench Verified, andTerminal Bench. 6 7 4

For agents that plan, code, and call tools over hours-long sessions, the combo of latent MoE and Mamba reduces the “thinking tax” (expensive reasoning per step) while the 1M context tames history replay. Open resources and NVFP4 efficiency point toward lower per-token cost on Blackwell; keep an eye on long-context coherence beyond 500K tokens and cost curves versus dense baselines. 5 6

Microsoft Phi‑4‑reasoning‑vision‑15B: Compact Multimodal Reasoning

Microsoft introduces Phi‑4‑reasoning‑vision‑15B, an open‑weight15B multimodal model tuned for math/science reasoning and UI grounding. It uses a mid‑fusion setup withSigLIP‑2 ** (Naflex) dynamic‑resolution vision encoding and thePhi‑4‑Reasoning** backbone. The team reports competitive accuracy to much slower models that use10× more compute and tokens across benchmarks like ChartQA, MathVista, MMMU, and ScreenSpot; training leaned on about200B multimodal tokens plus Phi‑family language base models with far less overall compute than many peers. 8

A key takeaway is data/process, not just scale: careful curation and targeted augmentation (e.g., fixing formatting, re‑generating wrong answers, diversified prompts, dynamic resolution) boosted performance, especially on high‑resolution screen tasks (e.g., ScreenSpot‑Pro improved notably when increasing visual token budget to ~3600). This aligns with broader best practices for **vision‑language models ** (VLMs) that combine a visual encoder with an LLM to align image patches and text into a shared space. 8 9 10

Why it matters: smaller, faster multimodal models reduce deployment friction and cost while retaining reasoning skills. The open‑weight release also invites replication of their recipe—mid‑fusion, dynamic resolution, and selective data repair—which recent surveys highlight as levers for robust multimodal generalization. 8 11

Research Papers

TurboQuant: Lossless‑Quality Compression for the KV Cache

Large language models keep a temporary “cheat sheet” called the key‑value (KV) cache to avoid recomputing attention. Google’s new TurboQuant compresses this cache dramatically—researchers report an~6× reduction in memory and up to8× faster attention score computation in some tests—without degrading output quality. Crucially, it’s aimed at inference memory, so it doesn’t fix training RAM needs, but it can make serving cheaper and enable longer contexts on the same hardware. 12 13

What’s inside: TurboQuant first applies PolarQuant, which represents vectors in polar form—storing a radius (magnitude) and angle (direction)—to achieve compact, normalized encodings that better survive compression. It then addsQuantized Johnson–Lindenstrauss ** (QJL), a1‑bit** residual correction that preserves pairwise relationships (+1/−1) to tighten attention scores. Together, they quantize cache entries to as low as3 bits with “perfect downstream results” on long‑context evals using Gemma and Mistral open models in Google’s report. 13 12

Google cites ICLR 2026 presentation plans and references prior components—PolarQuant for quantization andQJL for training/optimization—while early hardware results show4‑bit TurboQuant attention running8× faster than32‑bit keys onNVIDIA H100 GPUs in specific paths. The industry chatter dubs it a “Pied Piper moment,” but context matters: it’s a lab result today and targets inference‑time KV cache, not end‑to‑end RAM, so real‑world impact depends on integration into runtimes and compatibility with attention variants. 12 13

Community Pulse

Hacker News (575↑) — Technical optimism with curiosity about limits: how far can quantization go before it harms reasoning and accuracy?

"i wonder what is the limit of quantization when it starts to destroy the logic of weights?"

Hacker News (93↑) — Mixed on compact multimodal progress: benchmark gains are welcome, but some doubt real‑world understanding versus “world models.”

"I'm very happy to read about this progress but I don't find it particularly surprising. The big labs optimize for accuracy/high scores on benchmarks first; I automatically expect that (with some research effort) a model with 100x few parameters can achieve the same scores."

"yeah i know lol, that’s kind of my point. impressive that it runs on your gpu, but it still can’t tell you what happens if you tilt a glass..."

Why It Matters

TurboQuant’s reported KV‑cache shrink and attention speedup, if borne out in frameworks, could cut inference costs and unlock longer contexts on current GPUs; in parallel, Nemotron 3 Super and Phi‑4‑reasoning‑vision‑15B show a broader shift toward efficiency—MoE routing, state‑space layers, dynamic‑resolution vision—over brute‑force scaling. 13 4 8

For readers, the takeaway terms—KV cache, vector quantization (PolarQuant), 1‑bit residuals (QJL), latent MoE, state‑space models (Mamba), and mid‑fusion VLMs—are becoming the toolkit for running bigger ideas on smaller budgets. The next six months will test how these lab wins translate to production latency, stability past 500K–1M tokens, and actual per‑token pricing. 12 5 8

Sources 13

Helpful?

Comments (0)