Vol.01 · No.10 Daily Dispatch June 10, 2026

Latest AI News

AI · PapersDaily CurationOpen Access
AI NewsResearch
4 min read

vLLM update speeds AMD Zen CPU inference and adds Mellum v2 support

Patch v0.22.1 brings faster quantized inference on AMD Zen CPUs and new model compatibility, while two new papers outline practical paths to compress expert‑gated and low‑rank models with competitive accuracy.

Reading Mode

One-Line Summary

Serving and shrinking large models get practical boosts: vLLM speeds CPU inference, and two papers show new paths to compress expert and dense models.

Open Source & Repos

vLLM v0.22.1 adds Mellum v2 support and faster AMD Zen CPU inference

vLLM is an engine to run Large Language Models (LLMs) quickly and with less memory. The new v0.22.1 patch release (Jun 5, 2026) adds targeted bug fixes, new model support for JetBrains' Mellum v2, and zentorch‑accelerated quantized linear inference on AMD Zen Central Processing Units (CPUs). 1

Why this matters: vLLM positions itself as “easy, fast, and cheap LLM serving for everyone” and a “high‑throughput and memory‑efficient inference and serving engine.” CPU‑side speedups broaden where teams can deploy models when accelerators are scarce or costly, and the added model support reduces integration friction. 1

What to watch: organizations running on AMD Zen servers can test the new quantized linear path, and projects depending on Mellum v2 gain built‑in compatibility. The v0.22.1 notes list 8 commits from 6 contributors (1 new), signaling an incremental maintenance cadence. 1

Research Papers

Converting MoE models into dense LLMs with pruning and distillation

This paper proposes a way to turn a trained Mixture of Experts (MoE) language model into a standard dense model for easier deployment. Experts are scored, selected, and grouped, then concatenated into a dense feed‑forward network (FFN) and refined via knowledge distillation from the MoE teacher. The authors evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across 350 configurations on Qwen3‑30B‑A3B, and report that their diversity‑aware scoring consistently outperforms prior methods on Qwen3‑30B‑A3B, DeepSeek‑V2‑Lite, and GPT‑OSS‑20B. 2

Under a controlled comparison at matched parameter count, the MoE‑to‑dense approach beats dense‑to‑dense pruning by +6.3 percentage points in average downstream accuracy after about 4‑billion‑token distillation and trains 1.6× faster in wall‑clock time. Because MoE typically requires all expert parameters to be resident in memory, converting to a single dense student can simplify serving when memory is tight. 2

SigmaScale: learned scaling improves SVD‑based low‑rank compression

SigmaScale compresses Large Language Models (LLMs) by learning auxiliary scaling matrices that enhance truncated Singular Value Decomposition (SVD) low‑rank decompositions. It optimizes two sets of vectors defining diagonal row and column scalings under an activation‑aware compression loss, lowers effective‑rank entropy, and shows strong correlation between that reduction and compression loss. On Llama 3.1 8B Instruct and Qwen3‑8B, SigmaScale is competitive with similar SVD‑based methods across perplexity and zero‑shot benchmarks. 3

The takeaway for practitioners: learned, activation‑aware transformations adapt compression to each weight matrix’s structure, offering a more flexible route to low‑rank LLM compression. The results suggest it as a viable option when lowering inference compute cost is a priority. 3

Why It Matters

Tools that squeeze more work from the same hardware reduce deployment risk and cost. vLLM’s CPU‑side acceleration widens hardware options, while the two compression studies show credible ways to keep accuracy high even as models get smaller or simpler. 1

For teams balancing latency, memory, and budget, converting MoE teachers into dense students and adding learned scaling to low‑rank decompositions complement inference engines like vLLM: one reduces model footprint, the other raises serving efficiency. 2

Sources 3

Helpful?

Comments (0)