A medical AI writes CT reports step by step — with a full reasoning trail doctors can inspect
RadAgent turns chest CT reading into a transparent, tool-using workflow and posts big gains in accuracy and robustness. Meanwhile, new agent papers and repos focus on navigable knowledge, coherent web UIs, and the 'harness' around models.
One-Line Summary
Agents move beyond one-shot answers: a medical AI exposes its step-by-step reasoning, enterprise QA shifts to navigable knowledge trees, and thin "harnesses" make browser agents more reliable.
Research Papers
RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography
This system reads a chest CT scan like a careful resident physician: it plans, calls tools, records each interim decision, and then drafts the report — leaving a complete trail doctors can inspect, validate, or revise. RadAgent turns CT interpretation into an explicit, iterative workflow rather than a single black-box generation. 1
The paper reports significant gains over a 3D vision-language baseline (CT-Chat): macro-F1 improves by 6.0 points (a 36.4% relative jump), micro-F1 by 5.4 points (19.6% relative), and robustness under adversarial conditions by 24.7 points (41.9% relative). Importantly, RadAgent achieves 37.0% faithfulness — the share of report content directly supported by its own reasoning trace — a capability absent in the baseline. In plain terms, it gets more clinical facts right and shows its work. 1
Because clinicians can audit every intermediate step and tool call, RadAgent aligns with practical guidance for high-stakes agents: add state, observability, and human checkpoints rather than relying on a free-form loop. This mirrors field reports that production agents fail not for lack of model IQ, but due to orchestration gaps like infinite retries and silent failures — problems mitigated by explicit state machines and human-in-the-loop gates. 2
What to watch next: external validation on more datasets, workflow fit in radiology PACS/RIS, and how faithfulness metrics translate to fewer addenda or callbacks. The design choice — structured, inspectable reasoning — signals a push toward safer, auditable clinical AI. 1
Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG
This work gives a question-answering agent a “map of the library” before it answers: it compiles a document corpus offline into a hierarchical directory of skills and summaries, then lets the agent drill down branches, backtrack, and pull full docs by ID. That explicit structure helps the agent decide where to look rather than passively reading top search hits. 3
Called Corpus2Skill, the pipeline clusters documents, writes model-generated summaries for each node, and materializes a tree of navigable files. On WixQA — an enterprise support benchmark — it outperforms dense retrieval, RAPTOR, and agentic RAG baselines across all reported quality metrics, suggesting that giving the agent a visible corpus topology improves multi-hop reasoning and evidence combination. 3
For teams choosing between retrieval-augmented generation, memory, or wiki-style knowledge, this paper pushes reasoning earlier in the flow: organize at ingest, then navigate at serve time. That complements practitioner guidance that RAG is stateless by default and tends to dilute structure via chunking, while wiki or navigable layers trade more upfront synthesis for lighter, clearer queries later. 4
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
This agent builds webpages that look consistent end to end by planning the overall layout first, then generating images, videos, and components to match that plan, and iterating until the pieces fit. In other words, it coordinates all the visual parts instead of creating elements in isolation that clash in style. 5
The authors introduce both a benchmark for multimodal webpage generation and a multi-level evaluation protocol. Experiments show MM-WebAgent outperforming code-generation and prior agent baselines, especially when generating and integrating multimodal elements — a common failure point for piecemeal AIGC pipelines. Code and data are linked from the paper. 5
The hierarchical plan–act–reflect loop here echoes how practical desktop/web agents operate: perceive the full screen, choose a constrained action, observe, and adjust. Analyses of Claude’s Computer Use architecture underline why this loop, grounded in what’s visible on screen, tends to be more robust across apps than brittle API hooks — useful context for agents that must maintain design coherence. 6
Open Source & Repos
browser-use/browser-harness: Self-healing browser harness that enables LLMs to complete any task.
This repo offers a minimal “harness” for browser agents: a thin layer on top of Chrome DevTools Protocol where the agent can even edit the harness mid-task — for example, adding an upload_file() helper when it realizes one is missing — then continue and finish the job. The pitch is radical simplicity: one WebSocket to Chrome, no heavy framework. 7
Who it’s for: teams who find that model choice matters less than the wrapper around it. Recent practitioner write-ups argue that the harness — prompts, skills, sub-agents, constrained decoding, and parsing — increasingly determines reliability, while frontier models converge in raw capability. A thinner, editable harness lowers the surface area where things break. 8
Ecosystem context: for users seeking self-hosted browser agents or skill packs, projects like WebBrain (free, MIT-licensed Chrome/Firefox extension with multi-provider LLM support) and curated Claude Code skill plugins show how capabilities can live outside the model. The common thread is portability and observability rather than lock-in. 9 10
Why It Matters
Transparent reasoning and navigable knowledge move agents from “smart but opaque” to “useful and auditable.” In medicine, that means clinicians can see and correct the chain behind a finding; in enterprise QA, it means the agent knows where it has and hasn’t looked. 1 3
At the same time, engineering is converging on a pragmatic lesson: the harness — how an agent plans, calls tools, tracks state, and exposes traces — often determines real-world reliability more than marginal gains in model benchmarks. Today’s releases reflect that shift from model-first to system-first design. 2 8
Comments (0)