Fine-tuning can make LLMs hallucinate more — new study shows how to curb it
Researchers tie fine-tuning–induced hallucinations to interference in a model’s existing knowledge and propose a self‑distillation recipe to steady outputs. Meanwhile, HyLo extends context up to 32× and Nvidia’s Nemotron 3 Nano Omni claims 9× higher multimodal throughput.
One-Line Summary
Training tweaks aim to cut LLM hallucinations while new architectures push longer context and faster multimodal agents.
LLM & SOTA Models
Nvidia debuts Nemotron 3 Nano Omni, a faster unified multimodal model
Nvidia releases Nemotron 3 Nano Omni, a single model that sees, hears and reads — unifying vision, audio and language to speed up agent workflows without juggling separate perception models. Nvidia says it reaches up to 9× higher throughput than other open “omni” models with similar interactivity and leads on document, video and audio understanding leaderboards; companies such as Palantir, Foxconn and Dell are testing or adopting it. 1
Under the hood, Nemotron 3 Nano Omni integrates vision and audio encoders into a 30B‑A3B hybrid mixture‑of‑experts design, removing cross‑model handoffs and reducing latency while maintaining strong multimodal perception for tasks like computer use and document intelligence. 1
A practitioner review characterizes it as a 30B‑parameter MoE with roughly 3B active parameters per token and notes an OpenAI‑compatible API, helping serve more concurrent users on the same GPU and easing integration into existing apps. 2
Research Papers
Why fine-tuning increases hallucinations — and how to curb it
This paper shows that supervised fine‑tuning (SFT) can increase hallucinations relative to what a model learned during pretraining, because learning new facts can degrade older knowledge. The authors propose a self‑distillation‑based SFT method that regularizes output‑distribution drift so models absorb new factual information while preserving prior knowledge. 3
They examine three explanations — capacity limits, behavior cloning, and localized interference — and find the main driver is interference among overlapping semantic representations; self‑distillation mitigates this interference. 3
In settings where acquiring new facts is unnecessary, the study reports that suppressing factual plasticity by freezing parameter groups preserves task performance while reducing hallucinations. 3
For production teams, the broader lesson aligns with industry guidance: reliability comes from system design, not just prompts — with retrieval grounding, confidence/abstention, and citation verification as first‑class pipeline components to contain and audit errors. 4
Sapiens2 pushes human-centric vision to 4K with stronger pretraining
Sapiens2 is a family of high‑resolution transformers for human‑centric vision that targets pose estimation, body‑part segmentation, surface normals, pointmap, and albedo — trained on a curated 1 billion human‑image dataset across 0.4B–5B parameter sizes, with native 1K resolution and hierarchical 4K variants. Its unified pretraining combines masked image reconstruction with self‑distilled contrastive objectives, and 4K models use windowed attention to reason over longer spatial context. 5
On benchmarks, Sapiens2 improves over its predecessor by +4 mAP for pose, +24.3 mIoU for body‑part segmentation, and 45.6% lower angular error for normals; for example, the 5B model reports 82.3 mAP on an 11K in‑the‑wild pose test set and 82.5 mIoU for segmentation, while even the 0.4B model reaches 79.5 mIoU. 6
Architecturally, Sapiens2 incorporates training‑stability and throughput advances such as RMSNorm, grouped‑query attention, QK‑Norm, and SwiGLU, and releases code for practitioners to adapt. 5
HyLo upcycles Transformers for 32× longer context and 90% leaner caches
HyLo offers a practical recipe to convert existing Transformer LLMs into hybrid long‑context models while preserving short‑context quality. It reports up to 32× context extension via efficient post‑training and more than 90% KV‑cache memory reduction, enabling up to 2M‑token prefill and decoding in a vLLM stack where comparable Llama baselines run out of memory beyond 64K. 7
The approach combines efficient Transformer blocks, Multi‑Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), plus staged long‑context training and teacher‑guided distillation. At similar scale, HyLo‑Qwen‑1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, LM Harness commonsense reasoning, and RULER‑64K. 7
Why the 90% cache cut matters: long‑context inference is often memory‑bound because the key‑value (KV) cache grows linearly with sequence length; for large models, KV memory can reach tens to hundreds of gigabytes at 100K tokens, making long contexts impractical without head‑sharing, compression, or architectural changes. 8
HyLo lands alongside other attempts to relax autoregressive trade‑offs, like emerging diffusion‑style LLMs that draft and iteratively refine blocks of tokens to tune quality vs. latency at runtime — a complementary direction that points to more flexible decoding and memory footprints. 9
SIREN-RoPE: making the rotation in RoPE learnable and time-aware
SIREN‑RoPE revisits rotary positional embeddings by treating the rotation space as learnable and signal‑conditioned, injecting timestamps, cyclical temporal patterns, and categorical metadata so attention can encode how tokens relate across time and context. 10
Using a dual‑branch sinusoidal representation network, the authors report consistent gains on a production‑scale news‑feed dataset for a generative recommender, improving calibration and ranking objectives with negligible computational overhead. 10
This fits a broader pattern in time‑series Transformers: temporal structure and continuous values require specialized embeddings and tokenization beyond language defaults, from patching to explicit temporal encodings. 11
Community Pulse
Hacker News (360↑) — Mixed views on theory’s usefulness: commenters debate asymptotic analyses versus practical sample complexity and real training advances. 12
"It's very much necessary but not sufficient. In real life the sample complexity matters a lot too, which is also asymptotics, but a more important one. E.g. how the central limit theorem is far more powerful than the law of large numbers." — Hacker News 12
"I don't think that this is true. You need an infinite number of dimensions for this (think Taylor's expansion, Fourier expansion, infinitely wide or deep NNs..)" — Hacker News 12
Why It Matters
Today’s updates converge on reliability and efficiency: a training recipe to curb hallucinations without erasing prior knowledge, a pathway to stretch context windows while shrinking memory, and a multimodal model that unifies perception to cut latency. Together, they point toward LLM systems that are easier to trust and cheaper to run. 3
For practitioners, the immediate takeaways are pragmatic: treat reliability as a pipeline problem (grounding and verification), watch long‑context memory budgets as a first‑order constraint, and consider unified multimodal perception when building agent workflows. 8
Comments (0)