LG's EXAONE 4.5 goes multimodal with long-context doc skills
LG AI Research releases EXAONE 4.5 with native vision-language training and a 256K context window tuned for document-heavy use, while NVIDIA's Nemotron 3 Super targets agent workloads with a hybrid Mamba-Transformer MoE. Two vision papers push open-world 3D detection and parameter-efficient generation.
One-Line Summary
Models and papers converge on long-context, document-centric multimodality and efficiency—making agents and vision systems faster, cheaper, and more capable.
LLM & SOTA Models
EXAONE 4.5 Technical Report
LG AI Research’s new EXAONE 4.5 is a vision-language model designed to read and reason over documents by training on images and text together. The team adds a dedicated visual encoder to the EXAONE 4.0 framework and pretrains natively across both modalities, with curation that emphasizes document-heavy corpora. The report highlights strong gains on document understanding and Korean contextual reasoning, alongside competitive general benchmarks. It also extends the context window to 256K tokens for long-context use in enterprise workflows. 1
The “document-first” data design is the practical difference: rather than chasing general image benchmarks, the training recipe is aligned to forms, tables, and rich layouts that trip up generic models. In results, EXAONE 4.5 outperforms state-of-the-art peers of similar size on document tasks while remaining broadly capable across general language tests—important if you need one model to handle both business documents and everyday chat. 1
Why it matters for teams: long-context + doc-centric multimodality means fewer brittle OCR/RAG chains and more direct “read this pack of PDFs and summarize discrepancies” workflows. The open-weight release also makes evaluation and on-prem deployment easier for regulated environments. 1
Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning
NVIDIA’s Nemotron 3 Super is built to power long-running agents by combining a Mixture of Experts with Mamba and Transformer layers to keep throughput high and memory steady—even over million-token contexts. It’s a 120B total parameter model with 12B active per token, a native 1M-token context window, and open weights/datasets/recipes for customization. On PinchBench, it scores 85.6%, positioning it as a leading open model for multi-agent “brain” workloads. 2
Architecturally, it introduces latent MoE (consult more experts for the same cost by compressing tokens), multi-token prediction (drafts several future tokens per pass for built-in speculative decoding), and a hybrid Mamba-Transformer backbone to balance sequence efficiency with precise recall. NVIDIA also spotlights NVFP4 pretraining for 4x memory and speed gains on Blackwell-class GPUs while maintaining accuracy. 2
For builders, the takeaway is cost and reliability at agent scale: multi-agent systems can emit up to 15x more tokens than chats, and Super targets that “thinking tax” with higher throughput and long-context stability—useful for codebase-spanning tasks, cybersecurity triage, and tool-heavy plans. 2
Research Papers
WildDet3D: Scaling Promptable 3D Detection in the Wild
This work aims to identify 3D objects from a single RGB image using whatever guidance you can give—text, clicks, or boxes—so it works beyond a fixed set of categories. The authors propose a geometry-aware detector that natively accepts text, point, and box prompts and can even use depth hints at inference for extra accuracy. They also build WildDet3D-Data, a large-scale dataset with over 1M images across 13.5K categories, verified by humans for real-world diversity. 3
On their new open-world benchmark (WildDet3D-Bench), the model hits 22.6/24.8 AP3D with text/box prompts; on Omni3D, it reaches 34.2/36.4 AP3D. In zero-shot tests, it records 40.3/48.9 ODS on Argoverse 2 and ScanNet. A notable result: adding depth signals at inference delivers big gains—on average +20.7 AP across settings—showing simple geometric cues can rescue hard cases in the wild. 3
For teams building AR, robotics, or mapping, promptable 3D detection reduces data bottlenecks: you can steer the detector with natural language or light user input, then bring in depth when sensors allow, rather than collecting narrow, category-specific 3D annotations. 3
360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries
If you’ve ever needed a system to know “exactly where this camera is” inside a known place, 360Loc provides that benchmark for 360° imagery—and stresses cross-device generalization. The dataset combines 360° reference maps with queries from pinhole, fisheye, and 360° cameras, plus a practical pipeline to generate ground-truth 6DoF poses using 360° camera–LiDAR capture. 4
The authors introduce a “virtual camera” method to crop lower-FoV images from 360° frames, enabling fair comparisons across camera types and improving feature matching and pose regression under cross-device domain gaps. Results show omnidirectional localization is more robust in symmetric or repetitive scenes and that virtual-camera augmentation boosts SOTA baselines. 4
ELT: Elastic Looped Transformers for Visual Generation
ELT proposes reusing a small set of Transformer blocks over multiple loops (with shared weights) instead of stacking dozens of unique layers, cutting parameter counts while preserving image/video quality. A training trick called Intra-Loop Self Distillation keeps intermediate loops consistent with the deepest configuration, so one training run yields a family of “any-time” models that trade speed for quality on demand. 5
Under equal inference compute, ELT reports a 4x reduction in parameters and still reaches FID 2.0 on class-conditional ImageNet 256×256 and FVD 72.8 on class-conditional UCF-101—competitive for its budget. The broader efficiency trend echoes recent TTS and decoding work: WAND shows up to 66.2% KV-cache reduction and 1.51–1.89× speedups with windowed attention, and SpecDiff-2 reports up to +55% tokens/sec and as much as 5.5× over standard decoding via diffusion-based drafting. 6 7
Open Source & Repos
fireworks-tech-graph: Technical diagrams from plain language (plus agnix for agent config hygiene)
This Claude Code “skill” turns natural-language descriptions into publication-ready SVGs (and PNG export), packing 7 visual styles and support for 14 diagram types, including full UML. It bakes in AI/Agent domain patterns like RAG and multi-agent tool flows, so you can describe a system and get a clean diagram without fiddling in a drawing app. Licensed MIT. 8
If you’re wiring up AI agents, agnix is a linter for agent configuration files that validates Skills, Hooks, Memory, Plugins, and Model Context Protocol-style configs across tools like Claude Code and Cursor. It ships 156 rules, offers auto-fix, and integrates with editors and CI via LSP/SARIF outputs—useful to keep sprawling agent setups predictable. 9 10
A practical tip from the Claude Code community: treat “Skills” as context routing, not just prompts—clear descriptions trigger the right instruction set at the right time and cut token waste in long sessions. That framing helps maintain quality as context grows. 11
Why It Matters
Long-context multimodality (EXAONE 4.5) reduces reliance on brittle pipelines for document work, while hybrid, open-weight agent models (Nemotron 3 Super) make always-on, tool-using systems more affordable and stable. On the vision side, promptable 3D detection and omnidirectional localization widen what “in-the-wild” systems can recognize and where they can operate. 1 2 3 4
The efficiency drumbeat is loud: parameter sharing (ELT), windowed attention (WAND), and diffusion-based drafting (SpecDiff-2) all point to a new baseline—do more with less compute, without giving up quality. For teams, that means faster iterations, lower bills, and models that fit real constraints. 5 6 7
Comments (0)