Agent Skills and AGENTS.md: How to Organize What Your AI Agent Can Do

If you have built an AI agent recently, you have probably hit the same wall: the model is capable, but the tooling around it is a mess. A pile of prompts, a handful of shell scripts, a few API wrappers, and no clean convention for how the agent decides which one to reach for. As agents take on real work, that sprawl stops being a cosmetic problem. How you organize an agent's skills is a design decision that changes what the agent actually does at runtime — not an afterthought you can clean up later.

The good news is that the ecosystem is converging on a shared pattern. Two pieces of that pattern — the AGENTS.md convention and the rise of agent-optimized CLIs — are worth understanding now, because they are starting to show up across tools and because there is fresh evidence that getting skill organization right has measurable downstream effects.

What are agent skills?

An agent skill is a discrete capability an agent can invoke to get something done: search a codebase, query a database, call an external API, render a file, post a comment. In practice a skill bundles three things together — a name, a description of when to use it, and the underlying action (a command, a tool call, or a script).

The reason skills matter as a unit of design is that an agent does not run all of its capabilities at once. At each step it selects one based on the task and the descriptions in front of it. So a skill is really two artifacts: the thing it does, and the metadata that tells the agent when to choose it. Most reliability problems in agents trace back to the second half — the agent had the right capability available but picked the wrong one, or could not tell two similar skills apart.

What is AGENTS.md and why is it becoming a standard?

AGENTS.md is an emerging convention for giving an agent a single, predictable place to read about the project it is working in and the capabilities available to it. Instead of cramming everything into a system prompt or scattering instructions across files, you put agent-facing guidance in a known location the agent is expected to look for.

The appeal is the same as any well-known config file: predictability. When the convention is shared, tooling can target it, agents can rely on it, and humans know where to make changes. A recent Hugging Face walkthrough showed an agent building a small 3D Paris gallery by chaining two Hugging Face Spaces together — a concrete example of an agent reading its context and composing capabilities rather than running a single hard-coded step (Hugging Face).

The takeaway is not that one specific file format has won. It is that the field is standardizing on the idea that an agent should have a discoverable, structured description of its world and its skills — and that this description is part of the product, not documentation bolted on at the end.

How does skill organization change an agent's runtime behavior?

This is the part that is easy to underestimate. It is tempting to assume that as long as a capability exists, the agent will use it correctly. The evidence points the other way.

The SkillJuror paper, Measuring How Agent Skill Organization Changes Runtime Behavior, studies exactly this question and finds that how skills are organized measurably changes what the agent does at runtime (arXiv). In other words, two agents with the identical underlying set of capabilities can behave differently depending on how those capabilities are grouped, named, and described.

That reframes skill design. The names and descriptions you write are not labels for humans — they are the surface the agent reasons over when it decides what to do. Vague descriptions, overlapping skills, and inconsistent granularity all degrade selection. If you want reliable behavior, you have to treat organization as a first-class part of the system, and ideally measure it the way SkillJuror does rather than assuming it.

Designing agent-optimized CLIs and tools

What makes a CLI "agent-optimized"?

Most CLIs were designed for humans at a terminal: terse output, interactive prompts, flags you are expected to remember. Agents have different needs. They benefit from predictable, structured output (so the result can be parsed instead of scraped), explicit and discoverable subcommands, and behavior that does not depend on an interactive session.

Hugging Face made this concrete when it described designing the hf CLI as an agent-optimized way to work with the Hub — building the tool around how an agent actually calls commands rather than retrofitting a human-first interface (Hugging Face). The general principle: if a tool is meant to be driven by an agent, design its interface for the agent's calling pattern from the start.

Worked example: chaining capabilities

The 3D Paris gallery demo is a useful mental model for what "good skills" enable. The agent did not have a single monolithic "build a gallery" capability. It chained two separate Hugging Face Spaces together to reach the result (Hugging Face). Composability like that only works when each capability is cleanly scoped and clearly described — so the agent can recognize that step one's output feeds step two. Well-organized skills are what make chaining reliable instead of accidental.

Best practices for structuring skills

How many skills is too many?

There is no magic number, but the failure mode is predictable: the more skills you add, the harder selection gets, especially when several of them look similar. Since skill organization measurably affects runtime behavior, treat every added skill as something the agent now has to disambiguate at every step. Prefer a smaller set of well-scoped, clearly distinct skills over a large set of overlapping ones. If two skills are routinely confused, that is a signal to merge, rename, or sharpen their descriptions.

Naming, descriptions, and discoverability

Because the agent selects on metadata, write names and descriptions for selection, not for documentation. A good skill description answers "when should I pick this, and when should I not?" Make boundaries explicit, avoid near-duplicate phrasing across skills, and keep granularity consistent so the agent is not choosing between one broad skill and ten narrow ones. Put this guidance somewhere discoverable — the AGENTS.md-style convention exists precisely so the agent has one predictable place to find it.

FAQ

Is AGENTS.md just a system prompt? No. A system prompt is in-context instruction for a single run; the AGENTS.md convention is a discoverable, structured file that describes the project and available capabilities in a predictable location, so both tooling and the agent can rely on it across runs.

Where do skills "live"? In practice, in whatever your runtime exposes to the agent — commands, tool definitions, or scripts — described in an agent-facing location the agent is expected to read. The point of the convention is consistency, not a single mandated path.

How do I test that the agent selects the right skill? Treat selection as a measurable property. The SkillJuror work shows skill organization changes runtime behavior, which means you can — and should — evaluate how reliably your agent picks the intended skill under different organizations rather than assuming it (arXiv).

Takeaways for Clawvard readers

An agent's skills are selected on their metadata, so names and descriptions are part of the runtime, not just docs.
The field is converging on AGENTS.md-style conventions and agent-optimized CLIs — predictable, structured surfaces designed for how agents actually call them.
Skill organization measurably changes behavior (SkillJuror), so structure deliberately and test selection instead of assuming capability equals correct use.
Favor a small set of clearly distinct, well-scoped skills, and make boundaries explicit so capabilities chain reliably.

If you are building agents, this is the layer that quietly determines reliability. For more on the surrounding stack, see our related explainer on DiffusionGemma and fast local AI models. And if you want a place to put these conventions into practice, try Clawvard or follow along as we keep tracking how agent tooling evolves.