Google brings multimodal search and page citations to Gemini API

Google’s File Search now works across images and text and can cite the exact page it pulled from—pushing RAG toward audit-ready answers—while Microsoft ships an open-source toolkit to govern what agents can do.

Find in this article

Reading Mode

One-Line Summary

AI search and agents tilt toward verifiability and control: Google’s Gemini API File Search adds multimodal retrieval with page-level citations, and Microsoft publishes an open-source governance toolkit for AI agents.

New Tools

Gemini File Search adds multimodal search, metadata filters, citations

Google expands the Gemini API’s File Search so apps can search across text and images together, add custom metadata to narrow results, and return page-level citations that point back to the original source, making retrieval-augmented generation easier to verify. ¹

Custom metadata filters let teams tag unstructured files with key–value labels like department or status and scope queries to those slices, which can cut irrelevant hits and speed up RAG workflows when users only need a specific subset. ¹

The update is powered by Gemini Embedding 2, which maps text, images, video, audio, and documents into one semantic space; Knight Li reports support for up to 8,192 text tokens, 6 images, 120 seconds of video, 180 seconds of audio, and 6 pages of PDF per request, 3,072‑dimensional vectors with optional 1,536/768 truncation, and measured gains such as Harvey’s +3% Recall@20, Supermemory’s +40% Recall@1, and Nuuly’s Match@20 rising from 60% to nearly 87% with overall identification above 90%. ²

Practically, this means an internal helper can point to the exact page in a long PDF and surface the matching product screenshot in one answer, which builds trust and supports rigorous fact‑checking workflows. ¹

Microsoft open-sources Agent Governance Toolkit for safer AI agents

Microsoft releases an open-source Agent Governance Toolkit to keep autonomous AI agents within policy through policy enforcement, zero‑trust identity, execution sandboxing, and reliability engineering, with coverage of the OWASP Agentic Top 10 risks. ³

The repository shows a v3.4.0 release dated May 5, 2026, including a fix that reduces false‑positive HIGH‑risk flags in the contributor reputation check for established accounts by dampening certain abuse signals; documentation and a quick start are provided. ³

Context: multi‑agent development is common with Microsoft’s AutoGen framework; Point of AI describes AutoGen as open source and widely used, noting Microsoft now points many new projects to a newer Agent Framework and that prototypes mature by adding governance and controls. ⁴

What This Means for You

If you own an internal knowledge base or support bot, page‑level citations give you an audit trail—require any AI answer to include the source page so reviewers can check claims quickly and share externally with confidence. ¹

If your work relies on screenshots, charts, and PDFs, indexing them alongside text can stabilize retrieval quality; Knight Li reports measurable gains from multimodal embeddings in legal, memory, and retail use cases, signaling that visual materials no longer need to be second‑class data. ²

If your team is experimenting with agents that call APIs or run code, governance should be explicit from day one; Microsoft’s toolkit packages core controls like zero‑trust identity, sandboxing, and policy gates aligned to the OWASP Agentic Top 10, and the latest release tightens risk‑scoring logic. ³

In practice, pair verifiable retrieval with governed execution: require page‑cited grounding before an agent proposes a high‑impact action, and route sensitive steps to a human; Point of AI notes teams often start with AutoGen and move to production once governance and controls are layered in. ⁴

Action Items

Pilot Gemini File Search on a real workflow: Upload a long PDF and a handful of relevant screenshots, ask 5 typical questions, and confirm answers include page numbers and the right visuals.
Design a metadata scheme: Pick 3–5 labels (department, status, audience, date) and tag a small document set, then rerun the same queries with filters to compare noise and speed.
Measure retrieval quality, not vibes: Build a 20‑question test and track Recall@1/20 before and after enabling multimodal indexing or task prefixes with Gemini Embedding 2; record results in a simple spreadsheet.
Add guardrails to your agent pilot: Run the Agent Governance Toolkit quick start in a sandbox, configure one policy gate plus a human‑approval step for a risky tool call, and map which OWASP Agentic Top 10 risks you covered.

Sources 4

[1] Blog Gemini API File Search is now multimodal [2] Knightli Gemini Embedding 2: Putting Text, Images, Video, and Audio in One Vector Space [3] Github microsoft/agent-governance-toolkit: AI Agent Governance Toolkit — Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10. [4] Pointofai AutoGen – AI Autonomous Coding Agent | Point of AI

Helpful?

0to1log Weekly

Latest AI News