Vol.01 · No.10 Daily Dispatch April 19, 2026

Latest AI News

AI · PapersDaily CurationOpen Access
AI NewsResearch
6 min read

OpenAI's GPT-5.4 puts computer-use agents and 1M-token context into work mode

The new flagship boosts spreadsheet/presentation work, ships native computer-use for agents, and posts big gains on OSWorld and BrowseComp. Google counters with Gemma 4 under Apache 2.0 and a robotics model that reads analog gauges.

Core Analysis

{"scores_breakdown":{"deterministic":{"structure":15,"traceability":14,"locale":10},"llm":{"expert_body":18,"learner_body":19,"frontload":13},"raw_llm":{"expert_body":89,"learner_body":94,"frontload":66}},"issues":[{"severity":"minor","scope":"en","category":"source","message":"The 'Community Pulse' / '커뮤니티 반응' paragraph (Hacker News (216↑) ...) lacks a citation link even though it references an external discussion; body paragraphs are otherwise well-cited."},{"severity":"minor","scope":"both","category":"accessibility","message":"Several common acronyms (e.g., API, GPU, H100, CDP, vLLM) appear without first-use expansions in both EN and KO texts; requirement asks that every acronym be expanded on first use."},{"severity":"major","scope":"frontload","category":"locale","message":"The English and Korean frontloads diverge in reported facts and emphasis: the Korean excerpt adds claims about Google countering with Gemma 4 and a robotics model reading analog gauges that do not appear in the English text. This creates a cross-locale mismatch in substance and editorial judgment."},{"severity":"major","scope":"ko","category":"structure","message":"The Korean focus_items section is empty/missing while the English focus_items are present. A missing mandatory frontload subsection reduces usability and parity between locales."}]}

Reading Mode

One-Line Summary

OpenAI pushes agents into everyday work with GPT-5.4’s native computer use and 1M-token context, while Google counters with Gemma 4 open models and a robotics upgrade that can read analog gauges.

LLM & SOTA Models

Introducing GPT-5.4

OpenAI’s new flagship is built to do real office work with less back-and-forth—think building spreadsheets, polishing slides, and handling long tasks where an AI needs to plan, use tools, and keep context. In ChatGPT, “GPT-5.4 Thinking” shows an upfront plan so you can steer mid-response, and it improves deep web research and long-context relevance. On knowledge work, GPT-5.4 reaches 83.0% wins-or-ties on GDPval across 44 occupations (vs. 70.9% for GPT-5.2), with human raters preferring its presentations 68.0% of the time; on internal spreadsheet tasks it averages 87.3% (vs. 68.4% for GPT-5.2). It also reduces factual errors: individual claims are 33% less likely to be false and full responses 18% less likely to contain any errors compared to GPT-5.2. 1

For developers, the headline is native computer-use: GPT-5.4 can operate desktops and browsers through screenshots, mouse/keyboard, and code, with up to 1M tokens of context for long-horizon planning. It posts state-of-the-art 75.0% on OSWorld-Verified (beating GPT-5.2’s 47.3% and surpassing 72.4% human performance), 67.3% on WebArena-Verified, and 92.8% on Online-Mind2Web (screenshot-only). Visual understanding also improves: 81.2% on MMMU-Pro and a lower 0.109 error on OmniDocBench. New image input detail levels include “original” up to 10.24M pixels (max 6000px) and “high” up to 2.56M pixels (2048px). Early users report faster sessions with fewer tokens; one partner cites 95% first-attempt success and 100% within three attempts across ~30K portals, ~3x speedup, and ~70% fewer tokens versus prior models. 1

Coding inherits GPT-5.3‑Codex strengths, matches or exceeds it on SWE‑Bench Pro (57.7% vs. 56.8% for 5.3‑Codex), and adds lower latency with a /fast mode (up to 1.5x token velocity). Tool use scales better through tool search and parallelization, and the model’s token efficiency cuts cost and increases speed on complex multi-tool workflows. GPT‑5.4 is available in ChatGPT (Thinking and Pro), the API, and Codex, with a new ChatGPT for Excel add‑in and curated spreadsheet and slide skills to try today. 1

Gemma 4: Our most capable open models to date

Google DeepMind releases Gemma 4 as an open, production‑ready family that aims to deliver strong reasoning and agent workflows on modest hardware—available under the commercially permissive Apache 2.0 license. The lineup spans Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts, and 31B Dense. The 31B ranks #3 and the 26B ranks #6 on Arena AI’s open text leaderboard, with claims that Gemma 4 “outcompetes models 20x its size,” signaling high intelligence-per-parameter for local and cloud use. Longer contexts reach 128K on edge models and up to 256K on larger ones, with video, image, and audio (E2B/E4B) support and native function-calling and structured output for agentic workflows. 2

DeepMind emphasizes practical deployment: unquantized bfloat16 weights fit a single 80GB H100, while quantized variants run on consumer GPUs. The E2B/E4B models are designed for phones and small devices in collaboration with Pixel, Qualcomm, and MediaTek, running fully offline with low latency. Tooling arrives day one across vLLM, llama.cpp, MLX, Hugging Face, NVIDIA NIM/NeMo, Ollama, and more, with weights downloadable on Hugging Face, Kaggle, and Ollama. 2

For background, Gemma’s open lineage started at 2B and 7B with instruction-tuned variants and broad Hugging Face support across Transformers, PEFT, bitsandbytes quantization, and Google Cloud/Inference Endpoints. Gemma 4 extends this open ecosystem while upgrading reasoning, code, and multimodal abilities to the current state of the art for open models. 3

Gemini Robotics ER 1.6: Enhanced Embodied Reasoning

Gemini Robotics‑ER 1.6 is built to help robots “think with their eyes”—improving spatial understanding, task planning, success detection, and a new skill: reading analog gauges and sight glasses encountered in real facilities. The model acts as a high-level planner that can point to relevant pixels, zoom, and even run code as part of “agentic vision,” allowing it to spot needles, units, and scales, then compute precise readings—developed with partners like Boston Dynamics for inspection workflows. 4

The upgrade advances multi-view reasoning to combine wrist and overhead cameras, making it better at knowing when a job is actually done—crucial for autonomy. Safety also improves: the model follows physical safety constraints more reliably (e.g., not handling liquids or heavy objects) and detects hazards more accurately than earlier baselines on text and video scenarios. Developers can try it now via the Gemini API and Google AI Studio, with a starter Colab that shows how to prompt the model for embodied reasoning. 4

For pointing—a foundation of spatial reasoning—ER 1.6 is more precise at counting, targeting grasp points, and following constraints like “only items that fit in the blue cup.” The instrument-reading pipeline illustrates the full loop: zoom into fine details, point at key ticks, compute proportions with code, and apply world knowledge to interpret the final value—an example of agentic perception that moves beyond static vision models. 4

Open Source & Repos

browser-use/browser-harness: Self-healing browser harness that enables LLMs to complete any task.

This lightweight Chrome DevTools Protocol harness is designed to let an AI agent truly drive the browser—typing, clicking, uploading files—while “self-healing” by editing its own helper functions mid-task. If the agent needs an upload_file() helper and it’s missing, it writes the function into helpers.py on the fly and continues, minimizing scaffolding and maximizing autonomy. One websocket to Chrome; no thick framework in the way. 5

The setup prompt is intentionally simple: point your coding assistant (Claude Code or Codex) at the repo, have it read install.md to connect to a real browser, then SKILL.md and helpers.py to understand capabilities. From there, the agent iterates: try an action, inspect errors, patch helpers, and retry—useful for tasks that break brittle scripted automations. 5

Who it’s for: builders experimenting with computer-use agents who value raw control and hackability over prebuilt recipes. Why it’s trending: native computer-use in frontier models (like GPT‑5.4) pairs naturally with a minimal CDP harness that the model can extend itself. 5

Community Pulse

Hacker News (216↑) — Discussion around Gemini Robotics‑ER 1.6 is mixed: many welcome physics-aware reasoning as a safety boost, while others debate how much explicit “world modeling” is necessary versus relying on LLMs plus agent loops to compensate. The consensus: better spatial understanding helps, but the right depth of physical knowledge remains an open design question. 4

Why It Matters

Today’s theme is “agents growing up.” GPT‑5.4 folds planning, long context, and native computer use into one stack for knowledge work and coding, while Gemma 4 puts competitive reasoning in a package you can actually run and fine‑tune locally. Robotics‑ER 1.6 shows what happens when vision reasons step‑by‑step, not just labels pixels. Together, these moves shrink the gap between chatbots and dependable digital and physical coworkers. 1 2 4

This Week, Try It

  1. ChatGPT for Excel add‑in: Build a quick model-driven spreadsheet using GPT‑5.4’s new skills; install from the Apps link in ChatGPT. 1
  2. Run Gemma 4 locally: Pull Gemma 4 weights on Hugging Face and try with Transformers or Ollama for an on‑device code assistant. 2
  3. Clone browser-harness: Point your coding assistant at the repo and let it wire up a browser automation skill end‑to‑end. 5

Sources 7

Helpful?

Comments (0)