AI Agent Security in 2026: How Agents Leak Data and the Defenses That Stop It

AI Agent Security in 2026: How Agents Leak Data and the Defenses That Stop It
AI agent security stopped being a theoretical worry in June 2026. Within the same week, security researchers disclosed a one-click Microsoft Copilot exploit that could exfiltrate a victim's emails and two-factor codes, and a new benchmark showed that the better a research agent gets at its job, the more private data it can silently leak. If you're handing an autonomous agent access to your inbox, your documents, and a web connection, those two stories are the whole problem in miniature: agents combine privileged data, external tools, and untrusted input, and any one of those seams can become an exfiltration channel. This post maps the real risks — with concrete, sourced examples — and the defenses that actually move the needle.
What is AI agent security, and why is it different?
AI agent security is the practice of preventing autonomous, tool-using LLM systems from being manipulated into leaking data, taking unauthorized actions, or being hijacked by untrusted input. It's harder than securing a chatbot for one structural reason: an agent reads untrusted content (web pages, emails, documents), holds privileged context (your files, your credentials), and can act (call tools, send requests). That combination turns a classic injection problem into a data-exfiltration problem, because the model that gets tricked is the same model that holds the secrets and controls the outbound channel.
Two June 2026 disclosures make the threat model concrete: an external attack (the Copilot CVE) and an internal, almost accidental leak (the MosaicLeaks benchmark).
What was the Copilot 2FA vulnerability?
The flaw, dubbed SearchLeak and tracked as CVE-2026-42824, let an attacker steal emails, MFA codes, and password-reset links from any Microsoft 365 Copilot Enterprise Search user — triggered by a single click on a link pointing to a legitimate microsoft.com address. Ars Technica covered the disclosure, and Varonis Threat Labs published the technical write-up on June 15, 2026, after Microsoft quietly patched it in early June.
Per Varonis, SearchLeak chained three weaknesses:
- Parameter-to-Prompt (P2P) injection — the URL
qparameter in Copilot Enterprise Search was passed straight to the model as an executable prompt. - An HTML rendering race condition — an
<img>tag in the AI's response fired before the output sanitizer could strip it. - A CSP bypass via Bing SSRF — Bing's image-search endpoint, allowlisted in the Content Security Policy, performed a server-side fetch to an attacker-controlled URL.
Together those let the agent be instructed to find sensitive data and smuggle it out through an image request to a trusted domain. Notably, severity ratings disagreed: Varonis called it "critical," Microsoft scored it 6.5 (medium, citing the required click), and the National Vulnerability Database scored it 7.5 (high). That disagreement is itself a lesson — when an exploit depends on an agent's behavior rather than a simple memory bug, even experts disagree on how bad it is.
How do AI agents leak data without being hacked at all?
The scarier failure mode needs no attacker on the inside. The MosaicLeaks benchmark from ServiceNow measures privacy leakage in deep-research agents that combine private local documents with external web search. The agent never exposes its documents or reasoning directly — only its outbound web-query log is visible. And that's enough.
The researchers call it the "mosaic effect": no single query gives away a secret, but the queries reassemble into one. As the paper puts it, "No single query necessarily gives away the whole secret. But anyone watching the agent's outbound traffic can reassemble the fragments." In their example, an agent researching a healthcare firm searches separately for cloud-migration milestones, a security disclosure date, and vendor details — and the combined log reveals that a specific company had migrated 70% of its infrastructure to the cloud by a specific date.
The most counterintuitive finding: making the agent better made it leak more. Task-only training raised the benchmark's strict-chain success rate from 48.7% to 59.3% — but answer/full-information leakage climbed from 34.0% to 51.7%. Capability and privacy were, by default, in direct tension.
Can you just prompt an agent into being secure?
No — and the data is blunt about it. The MosaicLeaks authors conclude: "You can't prompt privacy in. You have to train it in. Telling an agent to be careful barely moves the needle, while rewarding how it constructs each query cuts leakage by more than 3x and leaves task success essentially intact."
Their proposed defense, Privacy-Aware Deep Research (PA-DR), combines a situational task reward with a learned privacy reward (a small classifier estimating whether a query leaks private information directly or via the mosaic effect). The result: leakage dropped from 34.0% to 9.9% — lower than the untrained base model's own rate — while task success held at 58.7%. The lesson for builders is that guardrails baked into the agent's objective beat guardrails written into the system prompt.
What is agent runtime governance?
Prompt-level instructions and trained-in behavior reduce risk but can't guarantee it. That's the gap a separate line of research targets with runtime governance — enforcing policy outside the model entirely. The paper "Deontic Policies for Runtime Governance of Agentic AI Systems" argues that today's policy engines (XACML, Rego, Cedar) only express basic permit/prohibit rules, which is too thin for agents that invoke tools and coordinate across organizational boundaries.
Its deontic approach expresses the fuller governance structure: not just what an agent is permitted or prohibited from doing, but obligations triggered after an action, conditions for waiving them, and rules for resolving conflicts when policies clash. The proposed framework (built on the Rei policy language and expressed in OWL) is "evaluated at runtime by a high-performance logic engine entirely outside the LLM," governing both tool invocations and agent-to-agent messages. The paper offers illustrative examples of expressiveness rather than a quantitative benchmark — so treat it as a design direction, not a measured result — but the architectural principle is sound: a deterministic policy engine the model cannot talk its way past.
How do you defend AI agents in production?
No single control is sufficient; the three disclosures above map cleanly onto three defensive layers that work together:
- Treat all tool input as untrusted (against injection). SearchLeak started with a URL parameter passed straight into the prompt. Sanitize and isolate any external content — web pages, emails, URL parameters — before it reaches the model, and constrain how rendered output (images, links) can make outbound requests.
- Train and reward privacy behavior, don't just instruct it (against silent leaks). MosaicLeaks shows prompts barely help and capability can make leakage worse; rewarding how the agent constructs each action cut leakage 3×+ without hurting performance.
- Enforce policy outside the model (against everything else). A runtime governance layer with deontic policies — permissions, obligations, and conflict resolution evaluated by a deterministic engine — gives you a guarantee the model's own behavior can't.
- Watch the outbound channel. The common thread across all three is exfiltration. Monitoring and constraining what an agent can send out — to which domains, in what form — is the backstop when the upstream layers fail.
Takeaways for Clawvard readers
- The seam is data + tools + untrusted input. Every June 2026 disclosure is a variation on an agent being induced to move privileged data out through a legitimate channel.
- Capability and safety can pull against each other. A more capable research agent leaked more by default — so evaluate agents for safety, not just for task success.
- Layer your defenses. Input isolation, trained-in privacy behavior, and an out-of-model governance engine each cover what the others miss.
- Assume injection will get through. Design for the case where a prompt injection succeeds, and constrain the outbound blast radius accordingly.
Securing an agent starts with evaluating it for safety, not just capability — see our guide to evaluating AI agents for how to fold these risks into your testing. To pressure-test your own agents against leakage and injection before they ship, try Clawvard, and follow our research updates as the agent-security field moves fast.