AI NewsResearch

5 min read 5/22/2026

audio generationdiffusion LLMMixture-of-Expertstime-series forecastinglocal inferenceopen weights

Stable Audio 3 generates minutes of music in seconds, releases smaller weights

The new diffusion-based system adds variable-length generation and targeted inpainting, trained on licensed and Creative Commons data, and runs on consumer hardware. The team reports under-2-second outputs on an H200 and a few seconds on a MacBook Pro M4.

Find in this article

Reading Mode

One-Line Summary

Minutes-long audio generated in seconds, faster diffusion-style text models with expert routing (up to 1.5x), and time-series models scaling to 2.5B parameters with open weights.

Research Papers

Stable Audio 3 brings near‑instant music generation and editing

Stable Audio 3 is a family of AI models that generate and edit music and sound effects at variable lengths, including continuing short clips, with near‑instant results on mainstream hardware. The authors report producing several minutes of audio in under 2 seconds on an H200 Graphics Processing Unit (GPU) and in less than a few seconds on a MacBook Pro M4. The small and medium model weights are released, along with training and inference pipelines, and the models are trained on licensed and Creative Commons data. ¹

Technically, the models are fast latent diffusion systems built on a new semantic‑acoustic autoencoder that compresses audio into a compact latent space while preserving fidelity and semantic structure. They also support inpainting for targeted edits, and use adversarial post‑training to both improve quality and reduce the number of inference steps — a combination aimed at better prompt adherence without slowing generation. ¹

Variable‑length generation avoids computing full tracks when you only need short sounds, which cuts costs for practical workflows like sound effects. Releasing the small and medium weights means creators and tool builders can run them on consumer‑grade machines, positioning Stable Audio 3 as a hands‑on option for local music and SFX pipelines. ¹

TIDE speeds diffusion MoE LLM inference without retraining

TIDE is a resource‑efficient way to run certain text models faster on limited hardware by reusing which “experts” are loaded instead of swapping them every step. It targets diffusion Large Language Models (dLLMs) with Mixture of Experts (MoE) architectures and, in a single GPU–CPU (Graphics Processing Unit–Central Processing Unit) system, shows up to 1.4x and 1.5x throughput gains over prior baselines on LLaDA 2.0‑mini and LLaDA 2.0‑flash, respectively — all without any model training. ²

Diffusion LLMs are an alternative to autoregressive (AR) models, decoding blocks in parallel with bidirectional context. TIDE exploits the temporal stability of expert activations within a block and introduces an interval‑based refresh strategy that updates expert placement in an input/output (I/O)‑aware fashion. The authors formulate scheduling as a mathematical program to pick an interval that minimizes I/O traffic and CPU compute, yielding “free‑lunch” acceleration for MoE diffusion inference. ²

Toto 2.0 scales time‑series forecasting from 4M to 2.5B parameters

Toto 2.0 is a family of five open‑weights time‑series forecasting models trained with a single recipe that delivers reliable quality improvements as size increases from 4 million to 2.5 billion parameters. ³

The models set new state of the art on three benchmarks — BOOM, GIFT‑Eval, and TIME — and the paper details architecture, data, and a u‑muP hyperparameter transfer pipeline. All five base checkpoints are released under the Apache 2.0 license, making them accessible for applied forecasting work. ³

Open Source & Repos

LocalAI aims to run many models locally without a GPU

LocalAI is an open‑source engine for running language, vision, voice, image, and video models on local machines, even without a Graphics Processing Unit (GPU), under the MIT license. It positions itself as “run any model on any hardware,” emphasizing offline and private workflows. ⁴

The project shows active maintenance with a v4.2.6 release dated 2026‑05‑16 that includes integration updates (e.g., llama‑cpp) and documentation changes. For teams experimenting with local deployments, it’s a single place to start testing multiple modalities without relying on cloud services. ⁴

Community Pulse

Hacker News (97↑) — Mixed: praise for speed and smaller models, but pushback on “General MIDI” sound and genre limits, with debate over production readiness. ⁵

"It is insanely fast. Less than 2 seconds for 120 seconds of audio in my 3090. It sounds too much like general midi. It is better for electronica than for any other genre. Impressive nonetheless" — Hacker News ⁵

Why It Matters

Today’s drops point to a practical shift: creative audio is getting fast and local, diffusion‑style text models can be made more efficient without retraining, and time‑series forecasting now comes with open, scalable baselines. Terms to know: variable‑length generation (compute only what you need), inpainting (targeted edits), diffusion Large Language Models (parallel, bidirectional decoding), and Mixture of Experts (routing work to specialized sub‑networks). ¹

Sources 5

[1] Arxiv Stable Audio 3 [2] Arxiv Toto 2.0: Time Series Forecasting Enters the Scaling Era [3] Arxiv TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload [4] Github mudler/LocalAI [5] Ycombinator Hacker News discussion: Stable Audio 3

Helpful?

0to1log Weekly

Latest AI News