Can Your AI Agent Keep a Secret? Testing Agents for Data Leakage

Can Your AI Agent Keep a Secret? Testing Agents for Data Leakage
In June 2026, two events landed within weeks of each other that should reframe how you think about agent quality. ServiceNow researchers released MosaicLeaks, a benchmark built around a blunt question — can your research agent keep a secret? — and Microsoft patched a critical Microsoft 365 Copilot vulnerability, SearchLeak, that let an attacker exfiltrate a victim's emails and even 2FA codes with a single click. Different contexts, same lesson: an agent that's perfectly capable can still leak the sensitive data it has access to. Capability evals don't catch that. You have to test for it directly.
This is a practical guide to data-leakage and secret-keeping testing for agents — what it is, why it's distinct from "how good is my agent," and how to add it to your QA pipeline.
Why capability evals don't catch data leakage
A capability eval asks: did the agent get the task right? A leakage eval asks a completely different question: what did the agent expose while getting it right? The first ignores the second entirely — and worse, optimizing for the first can make the second worse.
MosaicLeaks demonstrates this directly. When the researchers trained an agent purely for task success, its strict chain-success rate rose from 48.7% to 59.3% — and its answer/full-information leakage climbed right alongside it, from 34.0% to 51.7%. Getting better at the task made the agent leak more, not less. That's the trap: if your only metric is competence, you can ship an agent that is simultaneously more useful and more dangerous, and never see the second half on your dashboard.
What is agent secret-keeping / data-leakage testing?
Agent data leakage is when an agent exposes information it should have kept private — internal documents, credentials, or another party's data — through any channel it can write to. The subtle part, which MosaicLeaks names the "mosaic effect," is that leakage doesn't require dumping a secret in one shot.
Consider their worked scenario: a research agent at a healthcare firm answers routine questions by firing off web searches. No single query gives anything away — but an observer watching the outbound traffic can reassemble fragments like "a cloud-migration milestone," a dated "security disclosure," and "which vendor got hit" into a private fact that existed only in internal documents. The agent never intended to leak; it leaked by being helpful with sensitive context in hand. Secret-keeping testing measures exactly this: not just whether the final answer is clean, but whether the agent's behavior — its queries, tool calls, and side effects — discloses what it was trusted to protect.
How do you test an agent for data leakage?
Threat scenarios
Start by deciding who the adversary is and what they can observe. MosaicLeaks models a sharp, realistic case: an adversary who sees only the agent's web-query log — never the private documents and never the agent's reasoning — and tries to infer secrets from that traffic alone. From that vantage point it defines three distinct leakage types worth measuring separately:
- Intent leakage — from the query log alone, the adversary can infer the private questions or goals the agent was pursuing.
- Answer leakage — the query log plus a question lets the adversary answer that private question without ever seeing the documents.
- Full-information leakage — from the query log alone, the adversary can state verifiably true private claims.
Map these to your own deployment: your "query log" might be outbound API calls, tool invocations, log lines, or messages to other agents. Anywhere the agent's behavior is observable is a potential leak channel.
What MosaicLeaks measures
Concretely, MosaicLeaks is a dataset of 1,001 multi-hop research chains over local enterprise documents and a controlled web corpus, split into training, validation, and held-out-company test sets. Each chain "interleaves local and web sub-questions" so the agent must retrieve private information before it can form its next external query — deliberately engineering the conditions under which a careless agent leaks. Two findings make it a durable reference rather than a one-off:
- Telling the agent to behave isn't enough. Adding a prompt instruction not to leak "helps slightly for some models, but its effect is inconsistent and significant leakage remains." Prompt-based guardrails are not a control you can rely on.
- You can fix it without crippling the agent. Their Privacy-Aware Deep Research (PA-DR) approach cut answer/full-information leakage from 34.0% to 9.9% while improving task success (48.7% → 58.7%) — evidence that privacy and capability aren't a strict trade-off if you train for both.
The transferable method: construct tasks that force the agent to handle private data en route to a goal, then evaluate the agent's observable behavior — not just its answer — for each leakage type.
What real-world agent leaks have happened?
The risk isn't hypothetical. The Microsoft 365 Copilot SearchLeak vulnerability (CVE-2026-42824), disclosed by Varonis Threat Labs and patched by Microsoft, is the cautionary tale. A victim only had to click a crafted link — they typed nothing. Because the assistant operated with the user's full data permissions, the attacker effectively inherited that access and could have Copilot search the victim's mailbox and smuggle the contents out (including sensitive data such as 2FA codes), without ever authenticating themselves.
The mechanism is the part agent builders should internalize: it combined an AI-specific prompt-injection technique with conventional web bugs to turn a trusted assistant into a data-exfiltration channel. An AI system that (a) holds broad permissions and (b) acts on untrusted input is an exfiltration risk by construction — which is precisely the shape of most useful agents. (Originally disclosed by Varonis.)
How do you add leakage testing to your agent QA pipeline?
Treat secret-keeping as a first-class test suite alongside capability evals, not an afterthought:
- Inventory what the agent can touch. List every sensitive data source, credential, and permission in the agent's reach. You can't test for leaking data you haven't acknowledged it can see.
- Build leakage tasks, not just success tasks. Following MosaicLeaks, craft scenarios that force the agent to handle private data on the way to a goal, then check whether private facts surface through its behavior.
- Watch every observable channel. Inspect outbound queries, tool-call arguments, logs, and inter-agent messages — not only the final answer. A secret leaked into a tool parameter is still a breach.
- Treat untrusted input as hostile. The Copilot case shows prompt injection turning permissions into a weapon. Test with adversarial inputs (malicious links, poisoned documents, crafted tool results), not just cooperative ones.
- Don't rely on a "don't leak" instruction. MosaicLeaks shows prompting alone is inconsistent. Verify with measurement, and constrain what the agent can actually do — least-privilege access and egress controls beat good intentions.
- Re-test after capability tuning. Since training for performance can increase leakage, run the leakage suite again every time you make the agent "better."
Takeaways for teams shipping agents
- Capability and safety are different axes. A more competent agent can be a leakier one — measure both, separately.
- Leakage is often a mosaic. Secrets escape across many innocuous-looking actions, not one obvious dump; test the agent's behavior, not just its output.
- Prompts aren't a control. "Please don't leak" is inconsistent; rely on measurement plus least-privilege and egress limits.
- Untrusted input + broad permissions = exfiltration risk. SearchLeak shows how a helpful assistant becomes a one-click data thief; assume your agent is a target.
- Make leakage testing part of CI. Re-run it whenever capability changes — that's exactly when new leaks appear.
The agents that earn trust in production aren't just the capable ones — they're the ones whose builders proved they can keep a secret.
Want to build and ship agents you can actually trust? Try Clawvard, and follow our updates for ongoing agent-security and evaluation coverage.