Can Your AI Agent Keep a Secret? A 2026 Guide to Agent Data Leakage and Real Evaluation

Your agent passed every benchmark on the leaderboard. It still leaked a secret. That gap — between scoring well and behaving safely on real work — is why AI agent security has stopped being a separate "safety" conversation and become an evaluation problem. In one week of June 2026, three independent signals converged: a ServiceNow benchmark showing research agents quietly leak private data through ordinary web searches, a Hugging Face study showing standard benchmarks miss how much work an agent actually does, and a reported Copilot vulnerability that let attackers steal two-factor codes. The lesson for anyone shipping agents is the same: the test that proves your model is smart is not the test that proves your agent is safe.

This guide turns those three failure modes into a single threat-and-evaluation framework — what to test, how to red-team it, and what a real agent benchmark looks like next to a vanity score.

Why agent security is now an evaluation problem, not just a safety one

A chatbot answers a question and stops. An agent plans, calls tools, reads private documents, fires off web requests, and feeds its own outputs back into the next step. Every one of those steps is a place where private information can escape — and none of them shows up in a single-shot accuracy number.

That is the core argument for treating AI agent security as evaluation: the dangerous behaviors are process behaviors. They live in the trajectory of tool calls, not in the final answer. If your evaluation only grades the answer, it is structurally blind to the most expensive failures. The rest of this article walks the three failure modes that the June 2026 research surfaced, then turns them into a checklist you can run against your own stack. (If you want the model-side companion to this, see our breakdown of how we'd pressure-test a frontier open-weights model in GLM-5.2 for agent builders.)

How do AI agents leak data? The mosaic-effect failure mode

The most quietly alarming result this week is MosaicLeaks, a benchmark from ServiceNow (huggingface.co/blog/ServiceNow/mosaicleaks). It measures the "mosaic effect": a deep-research agent combines private local documents with web search, and while no single web query gives anything away, the sequence of queries reassembles a secret for anyone watching the agent's outbound traffic.

The paper's own illustration is the clearest way to see it: "One references a cloud-migration milestone, one a January 2024 security disclosure, one narrows down which vendor got hit. No single query necessarily gives away the whole secret. But anyone watching the agent's outbound traffic can reassemble the fragments."

MosaicLeaks splits this into three measurable leakage types:

Intent leakage — an adversary infers the private research question from the query log alone.
Answer leakage — given the private question plus the query log, an adversary can answer it.
Full-information leakage — an adversary states verifiably true private claims from the query log alone, without even being told what to look for.

Why "just train it to be good" makes leakage worse

Here is the finding that should change how you think about this. On the benchmark's base agent (built on Qwen3-4B), strict task success was 48.7% and answer/full-information leakage was 34.0%. When the team trained the agent purely for performance, success rose to 59.3% — but leakage rose too, to 51.7%. As the authors put it, "a more informative query is often better for the task and worse for privacy." The model learned to pack more context into each search, which retrieved better documents and leaked more.

In other words: optimizing your agent for capability can actively increase its data-exfiltration surface. Capability and privacy are not the same axis, and improving one silently degraded the other.

You can't prompt your way out of it

The tempting fix is a system-prompt instruction: "don't reveal private details in searches." MosaicLeaks tested exactly that. A privacy-aware prompt cut leakage from 34.0% to 25.5% — but dropped task success from 48.7% to 44.5%, and never approached real safety. The authors' conclusion is blunt: "You can't prompt privacy in. You have to train it in."

Their privacy-aware training method (PA-DR) reached 9.9% leakage while keeping 58.7% task success — lower leakage than the untrained baseline, with most of the performance gains intact, and it did so without simply searching less (it issued more queries, just stripped of revealing specifics). The practical takeaway is not "use this exact method." It is: leakage is a trained-in property of the policy, and a prompt is not a control. One caveat worth keeping honest — the authors note MosaicLeaks is a controlled benchmark with synthetic enterprise documents and a fixed web corpus, not a measurement of any deployed system. Treat it as a lens, not a verdict on your stack.

Credentials are part of the leakage surface

Data leakage is not only about research summaries. Ars Technica reported a critical Copilot vulnerability that allowed attackers to steal users' 2FA codes (arstechnica.com). The specific mechanism is the publisher's to detail, but the category is the point: when an assistant has access to a user's context and can be steered by attacker-controlled input, the thing it leaks may be a one-time passcode, not a paragraph. Any data-leakage threat model that stops at "documents" is incomplete; credentials, tokens, and 2FA codes belong in the same column.

What does "agentic enough" actually mean?

The second failure mode is subtler: your benchmark says the agent succeeded, and it's technically right, but the number is hiding the cost. Hugging Face's "Is it agentic enough?" study makes this concrete (huggingface.co/blog/is-it-agentic-enough). Their framing: "Most benchmarks just look at the final answer. We wanted the whole process instead: not just whether the agent got it right, but how much work it took to get there."

Two agents can both return POSITIVE (0.9999) on the same task — one via a 40-line script that needed debugging, one via a single CLI command. Same answer, "very different profiles in cost, latency, token usage, and failures." A pass/fail benchmark scores them identically. A real agent benchmark does not.

Benchmarking on your own tooling vs. public leaderboards

The study's agent-eval harness tracks the trajectory, not the verdict: median time per task, token usage (new vs. cached vs. generated), runs-with-error rate (it flags silent failures like zero output tokens or no tool calls), and "marker adoption" — whether the agent actually used the affordance you built, e.g. your CLI.

The most important result for your own stack: an improvement can help big models and break small ones. Adding a packaged "skill" (CLI docs plus examples) helped larger models, but on Qwen3-14B the match rate on one task collapsed from 100% to 0% — the model mistook the CLI documentation for an executable tool and gave up instead of falling back to the working Python path. The authors' conclusion: "agent-facing APIs should be evaluated across model sizes, because a new affordance can reduce work for strong models while adding ambiguity for smaller ones." A change they "would likely have shipped as-is" was caught only because they measured the process.

For security, the connection is direct. The same trace-level visibility that catches a wasteful path is what catches a leaky one. If your evaluation can see token counts and tool-call sequences, it can also see a research agent issuing a suspicious cascade of narrowing queries.

How do you red-team an AI agent?

Static benchmarks describe normal behavior. Red-teaming probes adversarial behavior — and the tooling for it is maturing. Scout flagged a Hacker News launch of a model post-trained to "pen test instead of refusing" (Argus, argusred.com/cli): rather than declining offensive-security requests, it carries them out, which makes it usable as an automated test harness against your own systems. (Note: the only detail we're attributing here is that headline framing — treat the specific capabilities as unverified until you evaluate the tool yourself.)

The defensive read is the useful one. An offensive-security model is a way to generate the adversarial trajectories you need to evaluate against: prompt-injection payloads, tool-misuse attempts, and exfiltration probes, run at a volume no human red team can match. The pattern that's emerging across all three signals is consistent — you red-team an agent by running it against an adversary that is itself an agent, and grading the trace.

A practical agent eval + security checklist

Pull the three failure modes into one pre-ship pass:

Grade the trajectory, not just the answer. Log and score every tool call, token count, and outbound request. If your eval can't see the process, it can't see the leak.
Test for the mosaic effect. Don't ask "did one query leak the secret?" Ask "can the sequence of queries reconstruct it?" Inspect outbound traffic across a full task, not per-call.
Don't trust a privacy prompt as a control. Per MosaicLeaks, prompting cut leakage only marginally and cost accuracy. Treat prompt instructions as hints, and put real controls (training, egress filtering, query review) behind them.
Watch capability/privacy as two axes. Re-run your leakage eval every time you tune for performance — capability gains can raise leakage silently.
Evaluate across model sizes. An affordance that helps your big model can break a smaller one. If you serve multiple model tiers, benchmark each.
Put credentials in the threat model. 2FA codes, tokens, and API keys are leakable data. Test what your agent does when attacker-controlled text asks it to reveal or forward them.
Red-team with an adversarial agent. Generate injection and exfiltration trajectories at scale and grade how your agent handles them.

FAQ

How do I test if my AI agent leaks data?

Don't grade only the final answer — instrument the whole trajectory and inspect the agent's outbound calls across an entire task. MosaicLeaks shows leakage often comes from the combination of individually-harmless queries (the mosaic effect), so look at the sequence, not single requests, and measure intent, answer, and full-information leakage separately.

What's the difference between an agent benchmark and a model benchmark?

A model benchmark asks "did it get the right answer?" An agent benchmark asks "did it get the right answer, and at what cost in time, tokens, tool calls, and errors?" Per Hugging Face's "Is it agentic enough?", the process metrics catch failures — silent errors, wasted paths, small-model breakage — that a pass/fail score hides entirely.

Can prompt injection steal credentials like 2FA codes?

Yes — that's the category behind the reported Copilot vulnerability that allowed theft of 2FA codes (Ars Technica). When an assistant has access to user context and can be steered by attacker-controlled input, leakable data includes one-time passcodes and tokens, not just documents. Any agent threat model should treat credentials as part of the exfiltration surface.

Can I just add a privacy instruction to the system prompt?

It helps a little and not enough. MosaicLeaks found a privacy-aware prompt cut leakage from 34.0% to 25.5% while dropping task success — far from safe. Their conclusion: "You can't prompt privacy in. You have to train it in." Use prompts as a layer, not the control.

Takeaways for Clawvard readers

AI agent security is an evaluation discipline. The failures live in the trajectory of tool calls, so your eval has to see the trajectory.
Capability and privacy are different axes. Training for performance can raise data leakage; re-test leakage after every capability tune.
Prompts are hints, not controls. Privacy, like reliability, has to be built into the policy and the egress path.
Measure the process and measure across model sizes — that's what catches both the wasteful path and the leaky one.

When you're ready to evaluate a specific model under this framework, our companion piece walks the same eval lens over a frontier open-weights release: GLM-5.2: what the most powerful open-weights LLM means for agent builders. Building agents you actually need to trust in production? That's exactly what Clawvard is for — try it and bring this checklist with you, and follow our updates as the agent-eval tooling keeps moving.