ConvexTok designs near‑optimal tokenizers with certified 1% gap
A convex-optimization tokenizer replaces greedy rules with a global objective, improving bits-per-byte for language models and certifying how close the vocabulary is to optimal. Plus: live music diffusion on consumer laptops, AI’s forecasting limits, promptable 3D animals, and an incremental engine for always‑fresh agent context.
One-Line Summary
Today’s papers tighten core AI plumbing — near‑optimal tokenization, live diffusion for music, and incremental indexing — while a new benchmark shows models still struggle to forecast scientific progress.
Research Papers
ConvexTok builds tokenizers with global optimization and certificates
This paper proposes ConvexTok, a method to build tokenizer vocabularies by solving a single optimization problem rather than stitching subwords greedily. In plain terms: it picks the whole set of tokens to compress text best, not just the next best split; compared to byte pair encoding and Unigram, it improves intrinsic tokenization metrics and lowers bits‑per‑byte (BpB) when used by language models. 1
Technically, ConvexTok formulates tokenizer construction as a linear program (LP) and solves it with convex optimization. A notable twist is a certificate — a lower bound that tells you how far your tokenizer is from the best possible under the chosen objective; in experiments the gap is within 1% at common vocabulary sizes. 1
Why it matters: tokenization affects speed, memory, and how much text fits in a context window. Better BpB means each byte carries more information and models process more meaning per token; the paper reports some downstream gains but not consistently. If you maintain production models, the auditability and near‑optimality bounds are the practical takeaway; watch replications on non‑English corpora and how ConvexTok interacts with modern vocabulary sizes. 1
Live music diffusion runs interactively with KV caching and ARC-Forcing
This work adapts diffusion‑based music generators to live performance, keeping latency low enough to “jam” with musicians on a consumer gaming laptop. The authors propose Live Music Diffusion Models (LMDMs), which modify the generative process so streaming is possible and efficient. 2
The key engineering ideas are block‑wise key‑value (KV) caching to match — and then beat — the inference complexity of discrete autoregressive (AR) Live Music Models (LMMs), and a post‑training alignment method called ARC‑Forcing that reduces error accumulation without reinforcement learning (RL) or reward models. The paper demonstrates text‑conditioned generation, sketch‑based synthesis, and a “generative delay” effect used in a real artist–AI collaboration. 2
CUSP benchmark tests whether AI can forecast scientific breakthroughs
CUSP (Cutoff‑conditioned Unseen Scientific Progress) is a benchmark that asks AI systems to forecast research outcomes under controlled knowledge cutoffs. It evaluates feasibility judgments, mechanistic reasoning, solution design, and timing across 4,760 scientific events. 3
Results show systematic and domain‑dependent limits: models can pick plausible directions but fail to reliably predict if advances will occur and consistently misjudge when; timing in AI is more predictable than in biology, chemistry, or physics. Performance looks similar for events before and after the training cutoff, additional pre‑cutoff knowledge helps but not enough, and gaps widen for high‑citation advances; models also show overconfidence and response biases — pointing to unreliable uncertainty estimates. 3
Promptable multi-animal 3D from a single photo
SAM 3D Animal is a framework that reconstructs multiple animals in 3D from a single in‑the‑wild image, guided by prompts such as keypoints and masks to disambiguate crowded, occluded scenes. It builds on the SMAL+ parametric animal model and jointly reconstructs multiple instances. 4
To train it, the authors introduce Herd3D, a multi‑animal dataset with over 5,000 images spanning species, interactions, and occlusions. On Animal3D, APTv2, and Animal Kingdom datasets, the method reports state‑of‑the‑art results over model‑based and model‑free baselines, highlighting a scalable, prompt‑driven approach for wild‑scene animal reconstruction. 4
Open Source & Repos
CocoIndex: Incremental engine for long-horizon agents
CocoIndex is an “incremental sync” indexing engine designed to keep AI agents’ context fresh by reprocessing only the delta — the changes — across sources like codebases, Slack, meeting notes, and docs. Think of it as plumbing for retrieval‑augmented generation (RAG), agent memory, and enterprise search that avoids full re‑ingestion on every update. 5
The repository highlights production‑oriented ingestion and streaming extract‑transform‑load (ETL); the latest v1.0.6 (May 18) notes include an Elm tree‑sitter splitter and improved Rust→Python error context for app operations. It targets long‑horizon agents that need live indexing with minimal lag between edits and what the agent retrieves. 5
Why It Matters
Backbone choices — tokens, streaming generation, indexing — shape real‑world AI costs and behavior. ConvexTok’s near‑optimal tokenization shows how better fundamentals can yield measurable efficiency and auditable quality. 1
At the same time, the CUSP benchmark cautions that forecasting science remains hard for current models, so teams should treat predictions and uncertainty estimates with care, especially outside AI. 3
Comments (0)