AI NewsResearch

7 min read 4/4/2026

Google DeepMindGemma 4open sourceMixture-of-Expertsmultilingual embeddings4D reconstruction

Google’s Gemma 4 pushes open-weight frontier with Apache-2.0, day‑0 runtime support, and 256K context

A dense 31B and a latency-focused 26B MoE lead the Gemma 4 family, while Microsoft drops decoder-only multilingual embeddings that set new MTEB v2 marks. DeepMind’s D4RT reframes 4D vision with 300× efficiency gains.

Find in this article

Reading Mode

One-Line Summary

Google unveils Gemma 4 as a high-intelligence, Apache-licensed open family spanning phones to H100s, while Microsoft ships a SOTA multilingual embedding suite and fresh research advances multimodal reasoning and 4D scene understanding.

LLM & SOTA Models

Gemma 4: Byte-for-byte, the most capable open models

Google releases Gemma 4 in four sizes — Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE), and 31B Dense — designed for advanced reasoning and agentic workflows. The 31B ranks #3 and the 26B ranks #6 among open models on the Arena AI text leaderboard, with claims that Gemma 4 outcompetes models 20x its size; the family adds native function-calling, structured JSON, long context (128K on edge, up to 256K on larger), and multimodality (images/video across all, audio on E2B/E4B). All models ship under an Apache 2.0 license for commercial use. ¹ ²

A key hardware note: the 26B MoE activates only 3.8B parameters per token to cut latency, while the 31B dense maximizes raw quality. Unquantized bfloat16 weights fit on a single 80GB NVIDIA H100; quantized variants target consumer GPUs for local IDEs and agents. Edge-first E2B/E4B emphasize near-zero latency and battery/RAM efficiency, running offline on phones, Raspberry Pi, and NVIDIA Jetson Orin Nano in collaboration with Pixel, Qualcomm, and MediaTek. ¹ ²

This release lands with day-one ecosystem support (Hugging Face Transformers/TRL, vLLM, llama.cpp, MLX, Ollama, NVIDIA NIM/NeMo, ROCm, TPUs, and more). vLLM highlights first-day support across NVIDIA/AMD/Intel GPUs and Google TPUs, handling Gemma 4’s 128K–256K context and agentic features. For on-ramp, Google AI Studio and Edge Gallery host the larger and edge models respectively, while weights are downloadable from Hugging Face, Kaggle, and Ollama. ³ ²

Open Source & Repos

Microsoft Harrier-OSS-v1: Multilingual decoder-only embeddings hit SOTA

Microsoft introduces Harrier-OSS-v1, a multilingual embedding family in three sizes — 270M, 0.6B, and 27B parameters — that achieves state-of-the-art on Multilingual MTEB v2. Unlike traditional encoder-based approaches (e.g., BERT), Harrier uses a decoder-only backbone with last-token pooling and L2 normalization, and supports a long 32,768-token context window to embed large documents or code without aggressive chunking. ⁴ ⁵

Harrier is instruction-tuned for retrieval: prepend a one-sentence task instruction to queries (documents remain instruction-free), which lets the model adapt its vector space across domains like web search, bitext mining, and domain Q&A. The smaller 270M (≈640-dim) and 0.6B (≈1,024-dim) models use knowledge distillation to approach the 27B model’s quality, creating a practical cost–latency spectrum for edge to enterprise deployment. ⁴ ⁶

The shift to decoder-only embeddings aligns retrievers with modern generator LLMs and mitigates context loss: 32K-token inputs reduce semantic degradation from chunking, improving RAG recall and clustering on long-form content. Industry commentary notes potential disruption of proprietary embedding APIs as open SOTA models expand options for local, customizable pipelines. ⁵ ⁴

Research Papers

Phi-4-reasoning-vision-15B: Lessons from training a compact multimodal reasoning model

Microsoft announces Phi-4-reasoning-vision-15B, an open-weight 15B multimodal reasoning model that targets strong math/science reasoning and GUI grounding while remaining efficient. Trained on about 200B multimodal tokens (leveraging Phi-4-Reasoning trained on 16B and a Phi-4 core on 400B unique tokens), it aims to match or exceed larger models’ accuracy with far less compute than many recent VLMs that rely on over 1T tokens. ⁷ ⁸

Architecturally, it adopts mid-fusion with a SigLIP-2 Naflex vision encoder and the Phi-4-Reasoning backbone. An ablation across resolution-handling techniques shows dynamic-resolution encoders perform best on high-res inputs; setting a 3600 visual-token cap (≈HD 720p) lifts ScreenSpot-Pro accuracy to 17.5, outperforming standard multi-crop and dynamic-S2 on that benchmark slice. These findings guide a practical trade-off between rich joint representations and efficiency for UI-heavy and information-dense images. ⁷ ⁸

Data curation emphasized quality: heavily filtered/improved open datasets, targeted acquisitions, and internal domain sets; incorrect captions/answers were regenerated with GPT-4o or o4-mini when salvageable, and images were repurposed as seeds for new VQA/caption data where questions were low-quality. Evaluation highlights a favorable accuracy–compute Pareto position versus both faster-but-less-accurate and slower-but-heavier baselines across ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2. ⁷ ⁸

D4RT: Unified, fast 4D scene reconstruction and tracking

Google DeepMind presents D4RT (Dynamic 4D Reconstruction and Tracking), a unified encoder–decoder Transformer that recovers dynamic 3D structure over time from video via a parallel query mechanism. By asking “where is this source pixel in 3D at time t from a given camera?” and answering many such queries in parallel, D4RT solves point tracking, point-cloud reconstruction, and camera pose estimation within a single interface. ⁹

On efficiency, D4RT runs roughly 18x–300x faster than prior SOTA: a one-minute video processes in about five seconds on a single TPU, versus up to ten minutes before (≈120x improvement in the cited comparison). Benchmarks include MPI Sintel (motion blur, non-rigid deformation), Aria Digital Twin (egocentric household scenes), and RE10k (indoor/outdoor pose), where D4RT reaches higher fidelity and AUC without costly test-time optimization. ⁹

This speed–quality balance positions D4RT for real-time robotics perception, low-latency AR scene understanding, and as a building block for “world models” that disentangle camera motion, object motion, and static geometry — prerequisites for robust embodied AI. ⁹

Community Pulse

Hacker News (24↑) — Mixed: users report inconsistent local performance for Gemma 4 via LMStudio; debate blames tooling vs. the models' practical usefulness.

"That usually means the framework (LMStudio, for you) is doing something wrong, and in a couple of days an update will come out that fixes that." — Hacker News

Hacker News (93↑) — Mixed: enthusiasm for multimodal reasoning tempered by skepticism that smaller/local gains reflect true understanding vs. benchmark tuning.

"I'm very happy to read about this progress but I don't find it particularly surprising. The big labs optimize for accuracy/high scores on benchmarks first; I automatically expect that (with some research effort) a model with 100x few parameters can achieve the same scores." — Hacker News

"yeah i know lol, that’s kind of my point. impressive that it runs on your gpu, but it still can’t tell you what happens if you tilt a glass... A four year old can do this and we’re just barely on step one and a half." — Hacker News

Why It Matters

Open, high-capability models are consolidating a “local-first” path: Gemma 4’s 31B/26B options and E2B/E4B edge variants bring long context, multimodality, and agentic tooling under a permissive Apache 2.0 license, lowering costs and expanding where strong AI can run — from H100s to phones. Meanwhile, decoder-only multilingual embeddings with 32K context reduce RAG’s chunking tax, lifting recall on long-form content. ¹ ⁴

Research momentum complements this shift: compact multimodal models like Phi-4-reasoning-vision-15B and real-time 4D perception like D4RT suggest that better data/architectures — not just bigger budgets — are unlocking capability. The next competitive edge is likely to come from efficient reasoning, trustworthy perception, and the ability to run advanced agents anywhere. ⁷ ⁹

Sources 9

[1] Blog Gemma 4: Our most capable open models to date [2] Deepmind Gemma 4: Byte for byte, the most capable open models [3] Vllm Announcing Gemma 4 on vLLM [4] Huggingface Welcome Gemma - Google’s new open LLM [5] Marktechpost Microsoft AI Releases Harrier-OSS-v1 [6] Alabia Harrier-OSS-v1: The Multilingual Embedding Paradigm Shift [7] Microsoft Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model [8] Msft Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model (redirect) [9] Deepmind D4RT: Teaching AI to see the world in four dimensions

Helpful?

0to1log Weekly

Latest AI News