Model Evaluation

Research Agent Data Leakage: Inside the MosaicLeaks Benchmark

June 20, 2026·10 min read
Research Agent Data Leakage: Inside the MosaicLeaks Benchmark

Research Agent Data Leakage: Inside the MosaicLeaks Benchmark

Most agent-security writing is about attackers pushing malicious input in — prompt injection. Research agent data leakage is the mirror image: a well-behaved agent quietly letting private information out, one innocent-looking web search at a time. On June 18, 2026, researchers at ServiceNow published MosaicLeaks (arXiv:2605.30727), a benchmark that turns this exfiltration risk into something you can actually measure — and its headline finding is uncomfortable: you can't prompt privacy in, you have to train it in.

If you run deep research agents over a mix of internal documents and the public web, this is a failure mode you almost certainly have and probably aren't measuring.

What is research agent data leakage?

A deep research agent answers a question by interleaving two kinds of lookups: reading your private documents and issuing public web searches. The leak doesn't happen in the answer it returns to you. It happens in the outbound queries — the search strings that leave your perimeter and land in someone else's logs.

Any single query usually looks harmless. The danger is cumulative. MosaicLeaks calls it the mosaic effect: individually benign queries that, stitched together, reconstruct a confidential fact. The paper's worked example is vivid — separate searches about a cloud-migration milestone, a January 2024 security disclosure, and a vendor identity let an observer of the query log alone deduce that a specific company had migrated 70% of its infrastructure to the cloud by a specific month. No private document ever left the building. The questions the agent asked the open web gave the secret away.

What does the MosaicLeaks benchmark measure?

MosaicLeaks scores how much an adversary watching only the agent's query log can infer. It defines three escalating levels of leakage:

Leakage type What the adversary has What counts as a leak
Intent The query log They can infer the agent's research goals
Answer Query log + the private questions They can answer private questions without the documents
Full-information The query log They can state verifiably true private claims

This framing is what makes MosaicLeaks a confidentiality benchmark rather than a defense how-to. It treats secret-keeping as an evaluable property of the agent's behavior — the inverse of the prompt-injection work that dominates agent security today.

How the benchmark is built

The benchmark is 1,001 multi-hop research chains grounded in synthetic enterprise documents plus a controlled web corpus, split into 559 training, 98 validation, and 344 held-out test chains. Each chain interleaves local and web sub-questions with explicit dependencies, so solving it requires the agent to combine private and public information — exactly the condition under which mosaic leakage emerges.

The agent harness is deliberately simple, with four tools per step: Plan (generate local/web queries), Choose (pick documents to read), Read (answer the current hop), and Resolve (answer, read more, or replan). Evaluation uses normalized string matching on each hop, so both task success and leakage are scored on concrete, checkable outputs.

Why can't you just tell the agent to be careful?

This is the result worth internalizing. The team tested the obvious fixes, and the obvious fixes underperformed.

Prompting barely helps — and costs you accuracy. Adding privacy-aware instructions to the prompt nudged leakage down only modestly (from 34.0% to 25.5% for the Qwen3-4B baseline) while hurting task performance (strict chain success fell from 48.7% to 44.5%). The agent didn't learn to ask safer questions — it just asked fewer of them, getting quieter and dumber at the same time.

Training only for task success makes privacy worse. Optimizing the agent purely to solve more chains pushed strict success up to 59.3% — but leakage climbed to 51.7%. A more capable agent learned to pack more revealing context into each query. Raw capability and confidentiality were actively in tension.

The lesson: confidentiality isn't a knob you turn with words at inference time. It has to be shaped into how the agent constructs each query.

How do you actually reduce research agent data leakage?

MosaicLeaks' proposed answer is PA-DR (Privacy-Aware Deep Research), which combines two reward signals during training:

  1. A situational task reward that judges each call against other calls available at the same hop with the same information — rewarding the agent for searching the right source and retrieving the right document, rather than only scoring the final trajectory.
  2. A learned privacy reward — a classifier that estimates both the direct private-information leakage in the current query and the mosaic leakage created by adding that query to the existing log, and penalizes the larger risk.

The combination is the striking part:

Method Strict chain success Answer / full-info leakage
Base Qwen3-4B 48.7% 34.0%
Task reward only 59.3% 51.7%
Task + PA-DR 58.7% 9.9%

PA-DR cut leakage to 9.9% — below even the untrained baseline's 34.0% — while keeping task success essentially level with the task-only model. And it got there not by asking fewer questions but by asking better ones: PA-DR agents issued more web queries than the base model while dropping the revealing specifics (exact metrics, telltale dates, answer-shaped phrasing). Privacy came from query construction, not query reduction.

There's an efficiency kicker, too: the situational rewards reached outcome-reward-level task success with roughly 5–6x fewer training samples, so the privacy-aware recipe was also the cheaper one to train.

What this means for teams running research agents

You don't need to reproduce the paper to act on it. The transferable lessons:

  • Audit the query log, not just the answer. Your exfiltration surface is the outbound searches your agent emits. If you've never looked at them as an adversary would, start there.
  • Don't trust a privacy prompt. A "be careful with confidential information" system prompt is close to security theater here — it moved the needle a few points and degraded the task. Treat prompted privacy as a weak control, not a guarantee.
  • Assume the mosaic effect. Per-query review misses the risk; leakage is emergent across a session. Evaluate confidentiality over whole trajectories.
  • Make confidentiality an eval, not an afterthought. MosaicLeaks' real contribution is reframing secret-keeping as something you score. Bake a leakage metric into how you evaluate any agent that touches private data, the same way you'd track accuracy.

Frequently asked questions

What is the mosaic effect in AI agents? It's when individually harmless agent queries combine to reveal a confidential fact. No single search leaks the secret; the set of searches does.

Who built the MosaicLeaks benchmark? Researchers at ServiceNow (Alexander Gurung, Rafael Pardinas, and colleagues), published June 18, 2026, with an accompanying paper at arXiv:2605.30727.

Does prompting an agent to be private stop data leakage? Largely no. In MosaicLeaks, privacy prompting only modestly reduced leakage and hurt task performance — the agent asked fewer questions rather than safer ones.

Can AI agents keep secrets at all? With the right training they can keep them much better: reward-shaping (PA-DR) cut leakage more than 3x while preserving task success, where prompting and task-only training did not.

Takeaways for Clawvard readers

  • Research agent data leakage is measurable. MosaicLeaks turns "can your agent keep a secret?" into a benchmark with three concrete leakage levels.
  • The threat is the query log. Private documents needn't leave your perimeter for the secret to escape — the agent's public searches can reconstruct it via the mosaic effect.
  • Prompting fails; training works. Privacy instructions barely helped and hurt accuracy; rewarding how each query is built (PA-DR) cut leakage to 9.9% with task success intact.
  • Score confidentiality like accuracy. Add a leakage metric to your evaluation loop for any agent that mixes private and public data.

To put this into a broader testing practice, pair it with our AI agent evaluation guide for 2026, and for the wider threat picture see our AI agent security overview. If you're building research agents and want confidentiality evaluated as a first-class metric rather than bolted on later, that's the kind of rigor Clawvard is designed to support.

Related Articles