Agent Skills Best Practices: How to Structure Them (and the Mistakes to Avoid)

The fastest way to understand agent skills best practices is to start with what they're not: a skill is not a place to dump an LLM's answer to a problem it just failed to solve. Skills are structured, reusable packages of procedural knowledge — and most of the value (or waste) comes from how you scope and describe them, not from the model behind them. This guide covers when a skill is actually worth writing, how to structure one, how to benchmark whether it's helping, and how a new open standard lets agents discover skills at runtime.

This matters now because the practice is being actively re-litigated. A week-of debate ("You're probably using Agent Skills wrong") and two Hugging Face engineering pieces — one on benchmarking agents on your own tooling, one on agentic resource discovery — converged on the same uncomfortable point: a skill that helps a strong model can actively hurt a weaker one. Structure is the variable.

What are Agent Skills (and why most people misuse them)

An Agent Skill is a Markdown-based package of procedural knowledge stored in a project folder (for example, .claude/skills/). Each skill bundles metadata, documentation, and optional supporting tools or references. A typical layout looks like this:

monitor-gitlab-ci/
├── SKILL.md
├── monitor_ci.sh
└── references/

The most common misuse, per Anson Biggs's "You're probably using Agent Skills wrong," is asking a fresh agent to generate a skill for a problem it can't currently solve. As Biggs puts it, this is "identical to thinking blocks" — the agent has no foundational knowledge to encode, so the "skill" captures nothing durable. A related worst practice is taking an LLM's response to a question and saving it verbatim as a skill, without contextual refinement.

The clarifying rule from the same piece: "there are two reasons to make a skill — remembering a novel problem, and avoiding repetition." If a skill doesn't do one of those, it shouldn't exist.

The common mistakes

Over-broad skills vs scoped skills

A skill that tries to cover everything routes poorly and crowds the context. Biggs frames the valid use cases narrowly — for context (teaching a stateless agent project-specific patterns it would otherwise rediscover by trial and error), for repetition (documenting frequently performed tasks so you stop retyping instructions), and for hard problems (capturing insights from a genuinely difficult problem after you've resolved it). Each of those is specific. A skill that doesn't map to one of them is usually too broad to help.

Poor descriptions = poor routing

A skill is only useful if the agent reaches for it at the right moment, and that depends entirely on its description and discoverability. This is the same lesson the Hugging Face team drew from building agent-facing tools, summed up in two principles: "If it isn't tested, then it doesn't work," and "If it isn't documented, then it doesn't exist." Discoverability requires clear interfaces and well-structured documentation — a vague skill description is, functionally, an invisible skill.

A structure that actually works

Putting the sources together, a skill that earns its place tends to share four traits:

A single, clear purpose that maps to one of the three valid reasons — context, repetition, or a solved hard problem.
A precise description so the agent routes to it correctly, plus any references it needs to act.
Encoded knowledge the agent couldn't independently discover — not a re-statement of capabilities it already has.
A way to verify it helps rather than assuming it does (covered next).

That third point connects to a broader theme in our own research: agent reliability is largely an execution problem, not an intelligence one. Most failures come from how work is scoped, described, and sequenced — which is exactly what a well-structured skill addresses. We go deep on that in We tested 45,000 AI agents: the bottleneck is execution.

How do you benchmark an agent on your own tooling?

Intuition about whether a skill helps is unreliable — you have to measure it. Hugging Face's "Is it agentic enough?" lays out a concrete harness for exactly this. Instead of only checking whether the agent gets the right answer, it tracks several dimensions: Match % (was the final answer correct), median time and tokens (what did the solution cost), error rates including silent failures (runs that produce zero output tokens), and marker adoption (did the agent actually use the tool-specific behavior you intended).

The method tests each task under three tiers:

bare — only the base library installed, nothing else
clone — the full source checked out in the working directory
skill — a packaged skill (CLI docs plus task examples) loaded into context

Each run varies four things: the model driving the agent, the library revision, the task, and the tier. The harness (agent-eval) drives a coding-agent CLI and runs in parallel on identical hardware so comparisons are fair.

What "agentic enough" really measures

The point is efficiency of action, not just correctness. The article's example: two agents can both correctly label a sentiment task, "but one writes a 40-line Python script... while the other types transformers classify --model ... --text "..." and is done in one call." Agentic-enough tooling makes the second path the natural one.

The most important finding for skill authors is a warning. Adding a CLI-plus-skill affordance helped large models (faster median time, with the skill tier fastest) but hurt small ones. On Qwen3-4B, median new tokens jumped from ~2.4K to ~23K on the clone tier with no accuracy gain. Worse, Qwen3-14B on a sentiment task saw Match % collapse from 100% (clone) to 0% (skill tier) — the model "mistakes the CLI for a tool it can call directly" and made spurious tool calls instead of falling back to the working API. The takeaway, in the authors' words: "Agent-facing APIs should be evaluated across model sizes, because a new affordance can reduce work for strong models while adding ambiguity for smaller ones." Benchmark your skills on the model you'll actually ship.

Letting agents discover resources (agentic resource discovery)

Even a perfectly structured skill is useless if the agent never finds it. That's the problem Agentic Resource Discovery (ARD) targets — a draft open specification developed by contributors from Microsoft, Google, GoDaddy, Hugging Face, and others. It's explicitly "not a product or a marketplace... a shared standard that any company can implement independently."

ARD replaces the "install-first, use-later" model with runtime discovery. As the launch post explains, "ARD moves selection outside the LLM. A registry indexes capabilities with richer signals such as publisher identity, representative queries, compliance attestations, and tags. It exposes a REST endpoint. The client searches in natural language, and the model invokes whatever the search returns." It has two parts: a static ai-catalog.json manifest that publishers host at a well-known URL, and a dynamic registry API (POST /search) for live, ranked discovery. Hugging Face runs a reference implementation, the Discover Tool, searchable via hf discover search "...".

One caution the authors stress: discovery alone isn't enough. The full flow is Publish → Crawl → Search → Verify → Connect, and "discovery without verification just industrialises trusting strangers." If you adopt runtime skill discovery, build the verify step in deliberately.

Agent Skills FAQ

How many skills is too many?

There's no fixed number, but the discipline is qualitative: every skill should map to a real reason to exist — remembering a novel solved problem or avoiding genuine repetition. If you can't name which of those a skill serves, it's one too many. And since extra context can hurt smaller models, more skills is not automatically better.

How do I test whether a skill is helping or hurting?

Run the same task with and without the skill (the bare/clone/skill tiers above) and compare Match %, median tokens, and time — on the specific model you deploy. As the Hugging Face benchmarks showed, a skill that speeds up a large model can break a small one, so "it helped in my demo" isn't evidence. Measure it.

Takeaways for Clawvard readers

Write a skill for only two reasons: remembering a novel solved problem, or avoiding repetition. Never to paper over a problem the agent just failed.
Structure beats volume. One clear purpose, a precise description, and knowledge the agent couldn't otherwise discover.
Benchmark on your real model. Affordances that help large models can collapse accuracy on smaller ones — test across sizes before shipping.
Plan for discovery and verification if you adopt runtime skill discovery via standards like ARD; discovery without verification is just trusting strangers at scale.

Skills are an execution lever, and execution is where most agents actually fail. If you're building toward reliable agents, try Clawvard to benchmark them on your own tooling — and share this guide with whoever owns your team's skill library.