How to Evaluate an AI Agent: Tool-Use Capability and Data-Leakage Risk

How to Evaluate an AI Agent: Tool-Use Capability and Data-Leakage Risk
Most teams test an AI agent the same way they test a model: feed it a task, check if the final answer is right. That's not enough. Two agents can return the identical correct answer while one quietly burns 10× the tokens and the other quietly leaks a customer's private data to a web search. Two evaluation pieces published in June 2026 — Hugging Face's "Is it agentic enough?" and ServiceNow's MosaicLeaks — make the case that how to evaluate AI agents comes down to two axes that both matter: capability on your own tooling, and security under real data. This guide combines them into one practitioner framework.
If you're shipping agents to production, this is the gap between "the demo worked" and "we can trust it with real users." Here's how to measure both.
Why evaluating agents is different from evaluating models
A model produces an answer. An agent produces a trajectory — it plans, calls tools, reads results, recovers from errors, and only then answers. As Hugging Face's "Is it agentic enough?" puts it, scoring the final answer alone misses everything that happens in between. Their example: one agent solves a task by writing a 40-line Python script with several debugging cycles; another does it in a single CLI command. Both succeed. They differ enormously in cost, latency, token usage, and failure modes — and only one of those differences shows up if you grade the answer.
So agent evaluation needs to measure how much work it took to get there, and separately, what risks the agent took along the way. Capability and security are different questions, and a passing grade on one tells you nothing about the other.
Axis 1 — Capability: is your agent "agentic enough" on your own tooling?
The most important word in that question is your. An agent's score on a public benchmark says little about how it behaves against your internal APIs, your CLIs, and your docs. Hugging Face's harness builds the task suite from the actual tooling under test.
Building a task suite from your real tools
Their agent-eval setup evaluates agents across three tiers of how much help the tool provides:
- bare — the agent gets only the package installed (e.g.
pip install transformers), no extra guidance. - clone — the full source repository sits in the working directory.
- skill — a packaged "Skill" with curated CLI docs and task examples.
Each run tests one combination of (model × revision × task) on identical hardware, so differences are attributable to the change you made — not to noise. That last point is the discipline most home-grown agent tests skip.
What to measure (success rate, steps, recovery)
The harness tracks more than correctness:
- Match % — did the final answer contain the expected result?
- Median time and tokens — execution efficiency, splitting new vs. cached vs. generated tokens.
- Error rate % — runs that produced nothing (zero output tokens): silent failures that a pass/fail check on the answer would never surface.
- Marker adoption — which strategy the agent actually chose (e.g. calling a CLI tool vs. writing Python).
The findings are a warning against one-size-fits-all tooling. Adding a CLI-plus-Skill affordance helped large models (Kimi, GLM-5.1, MiniMax-M2.7) by cutting median time — but raised their token use on the clone tier from roughly 4k to 6.4k new tokens as they read the new docs. For small models it backfired: Qwen3-14B's classify-sentiment accuracy collapsed from 100% on the clone tier to 0% on the Skill tier, because the model mistook the Skill's documentation for an executable tool and gave up when it couldn't "call" it. The lesson Hugging Face draws: agent-facing APIs must be evaluated across model sizes, because the same affordance that streamlines a strong model can break a weaker one — something conventional testing would miss entirely.
Axis 2 — Security: can your agent keep a secret?
Capability tells you whether the agent can do the job. It says nothing about whether the agent can be trusted with the data the job touches. ServiceNow's MosaicLeaks tackles exactly that: can a deep-research agent that mixes private local documents with external tools (like web search) avoid leaking what it knows?
Data-leakage failure modes
The danger MosaicLeaks names is the mosaic effect: individually harmless web queries that, taken together, reveal sensitive information. An agent researching a confidential question can leak that question — or its answer — through the trail of public searches it issues, even if no single query looks sensitive. MosaicLeaks grades this at three levels:
- Intent leakage — an adversary infers what the agent was investigating.
- Answer leakage — an adversary can answer the private question from the query log alone (given they know what to ask).
- Full-information leakage — an adversary states verifiably true private facts from the query log alone, without being told what to look for.
Testing for secret-keeping
MosaicLeaks operationalizes this with 1,001 multi-hop research chains over enterprise documents and a controlled web corpus (split 559 train / 98 validation / 344 test), run through a tool-use harness with Plan, Choose, Read, and Resolve stages. The findings are sobering:
- A base Qwen3-4B agent hit only 48.7% strict chain success while leaking sensitive information at the answer/full-information level 34.0% of the time.
- Privacy-aware prompts barely helped and hurt task performance — you cannot prompt your way to a private agent.
- Worse, training only for task success made leakage worse, rising from 34.0% to 51.7%. Optimizing capability alone actively degrades security.
- Their privacy-aware training approach (PA-DR) cut leakage to 9.9% while lifting strict chain success to 58.7% — and reached comparable results with 5–6× fewer training samples than outcome-only methods.
The core insight transfers to anyone running agents on sensitive data: privacy has to be built into how the agent decides each action, judged against its privacy implications — not bolted on as an instruction at the end.
A simple combined evaluation checklist
Before an agent reaches production, score it on both axes:
- Capability: Build tasks from your tools, not public benchmarks. Measure success rate, steps/tokens, and silent-failure rate. Test across the model sizes you might actually deploy — an affordance that helps your big model can break your small one.
- Security: Test secret-keeping, not just correctness. Look for intent, answer, and full-information leakage across the agent's tool-call trail. Assume prompts won't fix it; verify with adversarial query-log analysis.
- Watch the trade-off: Optimizing capability alone can increase leakage. Track both metrics together, every iteration, so a gain on one axis doesn't quietly cost you the other.
FAQ
How do I benchmark an AI agent?
Build the task suite from your own tooling rather than relying on public leaderboards, then measure the full trajectory — match rate, median time and tokens, and silent-failure (zero-output) rate — across the model sizes you'd actually deploy. Hugging Face's agent-eval harness is a concrete template for this approach.
What is agent data leakage?
It's when an agent exposes sensitive information through its external actions — most subtly via the "mosaic effect," where a series of individually benign web queries collectively reveal a private question, its answer, or verifiable private facts. MosaicLeaks grades this as intent, answer, and full-information leakage.
How do I test if an agent leaks secrets?
Give the agent tasks that mix private documents with external tools, then analyze its query/tool-call log the way an adversary would: can you infer what it was researching, answer the private question, or extract true private facts from the log alone? MosaicLeaks shows privacy-aware prompts barely help, so test behavior, not stated intentions.
Which open models are best for agentic tool use?
There's no universal answer — it depends on your tools and model size. The key finding from Hugging Face's evaluation is that the same affordance can help a large model (Kimi, GLM-5.1, MiniMax-M2.7) and break a small one (Qwen3-14B), so the only reliable answer comes from evaluating candidates on your own task suite.
Takeaways for Clawvard readers
- "The demo worked" is not an evaluation. Score agents on capability and security before production, and measure the full trajectory, not just the final answer.
- Build your capability tests from your own tooling, and evaluate across model sizes — affordances don't transfer cleanly from large models to small ones.
- Treat security as a first-class metric: optimizing for task success alone can increase data leakage, and prompts won't fix it. Test secret-keeping directly.
Capability and security are two sides of the same readiness question. If you're building agents you intend to trust with real data and real tools, try Clawvard for your agentic workflows, and follow our updates for more hands-on evaluation guides.