Long-context training speeds up with a training-only attention wrapper
Lighthouse Attention compresses sequences around standard attention during pretraining, then removes itself after a short recovery phase. New papers also stress-test table understanding, speed up Mixture-of-Experts routing, and replay real news to grade adaptive agents, while a Kubernetes inference stack ships a breaking upgrade.
One-Line Summary
New research speeds up long-context training and MoE inference while stress-testing models on messy tables and adaptive forecasting, plus a Kubernetes inference stack ships a major upgrade.
Research Papers
Lighthouse Attention speeds up long-context pretraining with a removable wrapper
This paper introduces Lighthouse Attention, a training-time wrapper around standard scaled dot-product attention (SDPA) that compresses very long token sequences so long-context transformers can be trained faster, then removes itself near the end to recover a full-attention model. The method targets the quadratic time and memory bottlenecks in SDPA by doing hierarchical selection and pooling during training only. 1
The approach is symmetrical and selection-based: it pools queries, keys, and values at the same time while preserving left-to-right causality, improving parallelism. It adds subquadratic pre- and post-processing for adaptive compression/decompression and uses gradient-free selection to avoid a complex backward-pass kernel. 1
Training is done in two stages: most pretraining uses Lighthouse Attention, followed by a short “recovery” to full attention so the final model has no extra inference-time cost. Preliminary small-scale large language model (LLM) experiments show faster total training time and lower final loss than full-attention training under matched settings. 1
WildTableBench tests table understanding on messy real images
WildTableBench is a question-answering benchmark built from naturally occurring table images from forums and websites, designed to reflect real-world layouts and domains that demand structural perception and numerical reasoning. It contains 402 table images and 928 verified questions spanning 17 subtypes across five categories, and evaluates 21 proprietary and open-source multimodal foundation models. 2
Results show only one model exceeds 50% accuracy; the rest range from 4.1% to 49.9%. The authors diagnose persistent weaknesses in how models perceive structure and reason with numbers, positioning WildTableBench as a diagnostic tool for consumer and enterprise use cases like receipts, statements, and reports. 2
BEAM speeds MoE inference with binary expert masks
BEAM (Binary Expert Activation Masking) improves Mixture of Experts (MoE) efficiency by learning token-adaptive binary masks that decide which experts to activate, instead of relying on fixed Top-K routing. Using a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic sparsity end-to-end and integrates with the vLLM inference framework without architectural overhauls. 3
Experiments report retaining over 98% of the original model’s performance while cutting MoE-layer floating-point operations (FLOPs) by up to 85%, delivering up to 2.5x faster decoding and 1.4x higher throughput. The authors present it as a practical, plug-and-play path to faster MoE inference. 3
FutureSim replays real news to grade adaptive agents
FutureSim evaluates how agents adapt to new information by replaying real-world events in chronological order beyond their knowledge cutoff, with agents interacting with news articles and answering questions as events unfold from January to March 2026. The benchmark runs frontier agents in their native harness and measures forecasting ability in a realistic, time-ordered setting. 4
Findings indicate a clear spread in capability: the best agent reaches 25% accuracy, while many have a worse Brier skill score than making no prediction. The controlled replay enables study of long-horizon test-time adaptation, search, memory, and uncertainty reasoning. 4
Open Source & Repos
llm-d 0.7 ships CUDA 13 images, targets SOTA inference on Kubernetes
llm-d is a high-performance distributed inference serving stack for production deployments on Kubernetes, aiming to achieve state-of-the-art performance across accelerators. The project is open-source under the Apache 2.0 license, with v0.7.0 available. 5
Release 0.7.0 introduces a breaking change: all CUDA images move to 13.0.2, which requires NVIDIA driver 580 or later on the host. Nodes with older drivers must be upgraded before deploying v0.7.0 images—an important operational note for production clusters. 5
Why It Matters
Advances like Lighthouse Attention and BEAM target the two cost centers of modern AI—training and inference—by reducing compute at long sequence lengths and activating fewer experts per token, respectively. These approaches point to efficiency gains without permanent architectural penalties at inference time. 1
At the same time, reality-first evaluations reveal gaps: only one model clears 50% on real-world table images, and adaptive forecasting under time-ordered news remains challenging. Together, they signal that efficiency must be matched with robustness on messy, evolving tasks. 2
This Week to Try
- llm-d quickstart on a test Kubernetes node: review v0.7.0 notes and ensure NVIDIA driver 580+ before pulling images. https://github.com/llm-d/llm-d
- Skim WildTableBench’s paper figures to see why “clean table” demos overstate reality: look at examples and error cases. https://arxiv.org/abs/2605.01018
Comments (0)