Computer Use
Plain Explanation
Many business tasks live only behind a user interface, not a clean API. Copying data, filing forms, or navigating a portal still requires clicking buttons on a page. Traditional scripts often break when layouts change, and asking for ad-hoc integrations can take months. Computer Use tackles this by letting a model “see” the screen via screenshots and “act” by returning structured steps—click, type, scroll—that your app carries out. After each batch of actions, you send back a fresh screenshot so it can decide what to do next. Mechanically, your runtime runs a loop: enable the computer tool, execute every action in the returned actions[] in order, capture an updated screenshot, and repeat. Accuracy improves when you provide high-detail images (up to 10.24M pixels, or downscale to around 1440×900 or 1600×900 with proper coordinate remapping). Safety-wise, you isolate the browser or VM, keep a human in the loop for high-impact steps, and treat page content as untrusted input.
Examples & Analogies
- Claims intake on a vendor portal: An operations agent submits a multi-page claim via a web form. The model suggests typing policy numbers, selecting drop-downs, and clicking Next, while the runtime executes and returns screenshots after transitions.
- Calendar booking in a web app: “Book a 30‑minute slot next Tuesday afternoon.” The model scrolls the week view, clicks an open slot, and types details; the harness acts and returns the updated view for confirmation.
- Document formatting in a desktop suite: In a VM, the agent adjusts paragraph spacing by clicking Format → Paragraph and typing values. Each dialog update yields a new screenshot for verification.
At a Glance
| Built‑in Computer Use | Custom Harness (Playwright/Selenium/VNC) | Code‑Execution Harness | |
|---|---|---|---|
| Inputs | Text + screenshots | Text + your framework’s UI state/screens | Text + screenshots + short scripts |
| Actions from model | Structured UI steps (click, type, scroll, etc.) | Tool calls mapped to your automation APIs | Mix of UI steps and small programs |
| Looping pattern | Execute actions[] → screenshot → repeat | Your loop relays tool calls and images | Alternate between scripts and UI steps |
| Safety posture | Isolated browser/VM; human checks for high‑stakes | You enforce isolation and permissions | You gate script execution and screen scope |
| Typical fit | Tasks a person can do via UI | Existing automation stacks you already run | DOM + UI workflows that need code hops |
All three paths follow the same idea—see the UI and return actions—but differ in how you wire screenshots and execute steps in your runtime.
Where and Why It Matters
- Shift to isolated execution: Running in a sandboxed browser or VM is a default guardrail; page content is treated as untrusted to reduce risk from on‑screen instructions.
- Human confirmation on risky steps: Models are instructed to pause for approval on purchases, downloads, or sending communications, aligning with safety guidance.
- Evaluation focus includes latency: OSWorld analyses report that planning/reflection calls can dominate end‑to‑end time, pushing teams to reduce step count and call frequency.
- Higher‑fidelity vision inputs in production: Keeping screenshot detail at original resolution (up to 10.24M pixels) or careful downscaling improves click grounding and reduces UI misses.
Common Misconceptions
- ❌ Myth: The model directly controls the computer. → ✅ Reality: The host runtime executes every action; the model only proposes steps.
- ❌ Myth: If a site has an API, Computer Use will discover and call it. → ✅ Reality: It operates through the UI you expose; API calls require a separate tool you define.
- ❌ Myth: Screenshots are harmless context. → ✅ Reality: On‑screen content is untrusted and may include adversarial instructions; isolate environments and require confirmations for high‑stakes moves.
How It Sounds in Conversation
- "Keep the computer tool in the loop but gate any checkout with a human confirmation prompt."
- "After switching to detail: 'original' screenshots, click accuracy improved on small icons."
- "Run this in the Playwright harness inside a VM—no host env vars and filesystem blocked."
- "Latency is dominated by planning calls; per OSWorld‑Human, we should cut steps and reflection turns."
- "Reuse the same tool def and pass previous_response_id so the loop continues without re‑priming."
Related Reading
References
- OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
Finds planning/reflection dominate latency (75–94%) and agents overstep by 1.4–2.7×.
- Computer use | OpenAI API
Official guide: loop mechanics, action types, screenshot detail, and safety setup.
- Gemini 2.5 Computer Use Model Card
Model card: browser-first scope, inputs/outputs, and high-stakes confirmation guidance.