Docs/agent-service-vitals-protocol

Agent Service Vitals Protocol (ASVP)

Version: v1.1 (Draft) Status: Open standard, reference implementation available Reference implementation: Clawvard (see docs/service-telemetry-protocol-v1.md) Date: 2026-04-20

v1.1 changelog (additive + refinements, backward compatible with v1.0 uploads):

Added: Operational Vitals (tokens / cost / latency / tool use / model) as a second vitals category alongside Service Vitals.
Added: aggregates_by_model as a first-class conditional dimension — essential for multi-model agents and baseline-drift attribution when underlying models upgrade.
Added: User identity hashing guidance (necessary for follow-up-return metrics to work portably).
Simplified: complexity_bucket from 5 classes to 3 (small | medium | large) — agents can't reliably self-classify at 5-class granularity, noise dominated signal.
Demoted: human_expertise_guess to experimental/optional. Keep the field, don't slice on it in canonical analyses.
Shrunk: Level-1 illustrative JSON — the earlier detailed schema contradicted "unprescribed format." v1.1 replaces it with a bullet list.
Removed: delayed_effect_signals example list — the field remains open-array, but v1.1 doesn't pretend to standardize the ontology until v2.

Abstract

Agent Service Vitals (ASV) are structured, quantitative, observation-derived metrics that describe how well an AI agent serves human users. v1.1 covers two categories:

Service Vitals — the human's experience (abandonment, gratitude, frustration, follow-up returns, revisions, duration)
Operational Vitals — the resource cost (tokens, wall-clock latency, tool calls, model identity)

Both are captured by observation only — the human is never asked to rate anything, and raw content never leaves the agent's local environment.

The Agent Service Vitals Protocol (ASVP) specifies how agents observe themselves, aggregate into privacy-safe numeric summaries, and upload them. It is an open, cross-vendor, cross-host standard.

Why a new protocol?

Today's AI agent evaluation is dominated by benchmarks — static exams, multiple-choice sets, task leaderboards. Benchmarks measure what agents can do on synthetic tasks. They don't measure what agents actually do for real humans in the wild.

Existing approaches each have gaps:

Human surveys — high friction, strong selection bias, fatigue
Deterministic event capture (file-edit hooks, git commits) — misses semantic signals (user frustration, unresolved follow-ups)
LLM-as-judge on synthetic tasks — doesn't reflect real-world task distribution

ASVP fills the gap: observation-only, behavioral-semantic capture, real-world distribution, cross-agent comparable.

The v1.1 addition of Operational Vitals is not cosmetic. It's a Goodhart-hedging necessity:

Service Vitals alone → agents burn tokens and time to please users
Operational Vitals alone → agents cut corners and degrade experience
Both together → agents must be efficient AND effective (the only durable definition of good)

The core idea: vitals, not surveys

The metaphor is medical vital signs. A doctor doesn't ask the patient "how's your blood pressure feeling?" — they take the reading. Vitals are continuous, objective, zero-friction.

Property	Survey / rating	Vital sign
Friction	Human must stop and self-report	None — observed during normal life
Frequency	Rare, event-triggered	Continuous
Bias	Strong (respondents self-select; recall varies)	Low (objective measurement)

ASV is agent-observed, not human-self-reported. Agents are LLMs; reflection is what they're good at.

Service Vitals (v1.1)

Six vitals, each defined as a rate or quantity from passive observation of agent-human sessions.

1. Abandonment rate — ↓ lower better

Fraction of sessions in the window where the human disengages before the task reaches a satisfactory endpoint.

2. Gratitude rate — ↑ higher better

Fraction of sessions whose human language contains positive-affect signals. Detected via keyword match or LLM reflection (English-biased for keyword; LLM is more robust cross-language).

3. Frustration rate — ↓ lower better

Mirror of gratitude. Sessions with negative-affect markers or repeated corrective language.

4. Follow-up return rate (48h) — ↓ lower better

Fraction where the same human returns within 48h with the same/closely-related topic. A "didn't actually solve it" signal. Requires cross-session memory — see User identity stability below.

5. Revision cycle median — ↓ lower better

Typical number of "redo / change / not what I meant" rounds per session before acceptance or abandonment.

6. Session duration profile — contextual

Wall-clock duration (median + p90). Not directional in isolation — short is good (efficient) or bad (abandonment); long is good (deep) or bad (frustration spiral). Informative only when conditioned on task category.

Operational Vitals (v1.1 — NEW)

Six vitals describing the agent's resource consumption per session. Reported alongside Service Vitals.

1. Tokens per session

Sum of tokens_input + tokens_output across all LLM calls within a session. Optional: tokens_cached for provider-supported prompt caching. Reported as { median, p90 }.

2. Cost per session (USD approximate)

Derived cost in USD. Optional field — many agents don't have pricing visibility. When reported, assume ±20% accuracy.

3. First-response latency (ms)

Time from user's message to agent's first visible output (first token if streaming, first complete turn otherwise). { median, p90 }.

4. Total wall-clock time (s)

End-to-end session duration on the clock. Different from "duration" (in Service Vitals), which means engagement time; total_wall_time_s means elapsed time. { median, p90 }.

5. LLM calls per session

Count of distinct model invocations within a session. Matters for agents that make multi-turn reasoning calls or route across multiple models. { median, p90 }.

6. Tool calls per session

Count of tool invocations. { median, p90 }.

All Operational Vitals are optional — an agent can't always observe them (e.g. no access to provider cost APIs). Omit fields you can't honestly fill.

Conditional dimensions (v1.1)

Every vital is observed both overall and sliced by context. v1.1 has five conditional dimensions:

Task category (12-class ontology, unchanged)

debug | refactor | write_code | review_code | explain | research
plan | write_prose | analyze_data | decide | emotional | chat_casual

Domain tags

Freeform, ≤ 3 per session, lowercase. Examples: python, postgres, frontend, security.

Complexity bucket (simplified in v1.1 — was 5 classes, now 3)

small | medium | large

Heuristics:

small — trivial edits or one concept; ≤ 1 file; ≤ ~100 LOC
medium — 2–4 files, multi-concept reasoning
large — 5+ files, cross-repo, or system-design scope

The v1.0 trivial and massive buckets produced too much inter-agent disagreement — 80% of real sessions landed in medium regardless. 3 classes retains discrimination power with less noise.

Model identity (NEW in v1.1 — first-class dimension)

(model, provider) slice. Matters because:

Baseline drift when underlying models upgrade
Cross-model comparison within a multi-model router agent
Cost/quality tradeoff analysis
Regression detection — "frustration spiked when we shipped model X"

Every aggregates_by_model entry has { model, provider?, n, ...vitals... }.

Human expertise guess (demoted to experimental in v1.1)

novice | intermediate | expert | unknown

Agents can't reliably self-estimate this across consistent standards. Field remains accepted for forward compatibility, but canonical analyses SHOULD NOT slice on it until v2 introduces calibration.

Three levels of ASVP data

Level 1 — Per-session observation (agent-local, format UNPRESCRIBED)

The agent observes and remembers, in whatever local memory its host provides, these things per session:

Task category
Up to 3 domain tags
Complexity (small / medium / large)
How the user reacted — thanked? pushed back? abandoned? came back within 48h?
Duration + turn count
(v1.1) Model used + provider
(v1.1) Tokens + cost + latency + tool calls if observable
(optional) expertise guess

Format is the agent's choice. Keep in CLAUDE.md, HEARTBEAT.md, a notes file, the host's memory tool — whichever. The protocol does NOT prescribe a file path, JSON schema, or library.

Level 2 — Aggregate upload (wire format — MUST conform)

A normalized JSON document. See Appendix A for the canonical shape.

Level 3 — Derived diagnosis (implementation-defined)

Downstream analysis (self-baseline deviation, weakness reports, certification composites, peer comparisons) is not specified in v1. Implementations define their own Level-3 surface — Clawvard specifies a conditional-weakness report in its concrete implementation doc.

User identity stability (v1.1 — NEW guidance)

The follow_up_return_rate vital requires the agent to recognize a returning human. v1.1 provides guidance rather than a prescription:

The agent SHOULD compute a stable, local, one-way user hash to detect returns across sessions. Pseudocode:

user_hash = sha256(
  stable_local_identifier   // e.g. host username, telegram chat ID, IDE workspace path
  + local_agent_salt         // random per-agent-install secret, stored locally
)

Key invariants:

Never uploaded. Only used locally to detect "this user returned."
Stable across sessions for the same human on the same agent.
Opaque — not reversible or correlatable across agents/implementations.
Salted — protects against rainbow-table attacks on the identifier.

If the agent's host provides no stable identifier (truly anonymous chat), follow_up_return_rate becomes unobservable. Agent SHOULD omit the field rather than fabricate.

Privacy invariants (MUST)

ASVP is a zero-PII protocol. Every implementation MUST guarantee at the Level-2 upload boundary:

No raw user-message text (not full, not partial, not summarized, not paraphrased)
No raw agent-response text
No file paths, file names, project names, repo URLs, branch names
No personal information: emails, names, phone numbers, addresses
No device identifiers, IP addresses, or correlatable external IDs
No user-identity hashes — the Level-1 user_hash computation is local-only

Only structured counts, enum values, bucketed quantities, normalized rates, and model/provider names cross the network.

Implementations SHOULD enforce string-length caps server-side (default: 200 chars max) and SHOULD suppress reports where session_count < 3 (minimum aggregation threshold — prevents PII-like patterns from tiny samples).

Anti-Goodhart invariants (SHOULD, with I1 MUST)

Once any metric becomes consequential, agents optimize toward it. ASVP defends via the following:

I1 · Honesty-as-reward (MUST)

An agent reporting frustration_rate = 0.25 MUST NOT be penalized relative to an agent reporting 0.05, provided internal consistency holds. Under-reporting (detected via cross-checks or audits) is the only penalizable behavior along this axis.

I2 · Multi-signal composites

Derived diagnostics SHOULD combine multiple vitals (both Service and Operational) into composite scores. Single-vital public leaderboards are forbidden at the protocol level. Publishing "top agents by gratitude rate" incentivizes sycophancy.

I3 · Internal consistency gates

Flag implausible combinations (e.g. gratitude_rate > 0.9 AND abandonment_rate > 0.3). Exclude from composites; may trigger audit.

I4 · Slice quorum

A weakness/strength claim MUST require n ≥ 10 AND |σ| ≥ 2 against the agent's self-baseline. Below either threshold, no claim published.

I5 · Mystery-shopper audits

Implementations operating certifications SHOULD periodically run scripted interactions against opted-in agents and cross-check self-reports against ground truth.

I6 · Partial composite transparency

Composite formulas SHOULD be partially public — inputs named, exact weights held by the operator and revised periodically — so agents can't perfectly reverse-engineer optimal gaming.

I7 · Service-vs-Operational balance (NEW in v1.1)

Composites SHOULD include BOTH Service and Operational Vitals. An implementation that scores only Service Vitals rewards cost-blind agents; scoring only Operational Vitals rewards efficient-but-bad agents. Both in balance is the minimum defensible composite.

Relationship to existing standards

Protocol	Relationship
MCP / A2A	Orthogonal. MCP/A2A govern agent communication; ASVP governs self-reported service quality.
OpenTelemetry	Complementary. OTel captures deterministic request-path events; ASVP captures semantic behavioral signals from dialogue.
agentskills.io (SKILL.md)	Distribution vector. ASVP can be taught to agents via a SKILL.md; protocol is agent-host-independent.
Traditional benchmarks (MMLU, HumanEval, SWE-bench)	Complementary. Benchmarks = controlled capability; ASV = real deployment service quality. Both together = full picture.

Implementations

Reference — Clawvard

Full wire-level schema, JWT auth, server-side normalizer, admin adoption endpoint. See docs/service-telemetry-protocol-v1.md.

Building an ASVP-compliant implementation

Accept Level-2 uploads matching the canonical shape (Appendix A)
Honor all privacy MUSTs
Honor at minimum anti-Goodhart invariants I1 + I2 + I4 + I7
Use v1.1 task-category and complexity ontology, or document your deviation
Publish your Level-3 methodology
Declare your ASVP version in server responses

Versioning

Version	Scope
v1.0	6 Service Vitals, 4 conditional dimensions, privacy + anti-Goodhart
v1.1 (this doc)	+Operational Vitals (6), +by_model dimension, +user-identity hashing guidance, complexity simplified 5→3, expertise demoted to experimental
v1.2 (planned)	Failure-mode taxonomy, user-capability trajectory
v2 (planned)	Typed delayed-effect-signal ontology, cross-implementation baseline portability, failure-mode Safety vitals

Implementations SHOULD declare conformed version.

Governance

ASVP is maintained as an open standard in the Clawvard repository. Substantive changes follow a public RFC process. Implementations MUST honor privacy and MUST honor anti-Goodhart I1; SHOULDs are strongly encouraged.

Open questions (v1.1 punts on these)

Multilingual affect detection — keyword matching is English-biased; LLM-reflection is robust but costly. Implementations choose.
Delayed-effect signal typing — open array in v1.1; v2 formalizes ontology.
Cross-implementation baseline portability — each implementation keeps its own baseline for now.
Cohort definition — population baselines require cohort segmentation (e.g. "Claude Opus 4.x on Claude Code"); v1.1 leaves this implementation-defined.
Adoption incentives — opt-in by design; each implementation solves adoption on its own terms.
Failure-mode taxonomy (v1.2 candidate) — hallucinated / misunderstood / tool_errored / refused vs. flat frustration_rate.
User capability trajectory (v1.2 candidate) — did the user's expertise improve over time? Important for tutoring agents.
Consent layer — does the agent's human user need explicit notice they participate in a telemetry program? Legal/ethical; not specified in v1.1.

Appendix A — Canonical Level-2 JSON shape (v1.1)

{
  "service_telemetry": {
    "window_start": "<ISO8601>",
    "window_end":   "<ISO8601>",
    "session_count": 47,

    // ── Service Vitals ───────────────────────────────────────
    "aggregates_overall": {
      "abandonment_rate":   0.21,
      "gratitude_rate":     0.34,
      "frustration_rate":   0.18,
      "follow_up_48h_rate": 0.12,
      "revision_cycles":    { "median": 1, "p90": 4 },
      "duration_s":         { "median": 280, "p90": 900 }
    },

    // ── Operational Vitals (v1.1 NEW) ────────────────────────
    "aggregates_operational": {
      "tokens_per_session":          { "median": 12000, "p90": 45000 },
      "cost_per_session_usd":        { "median": 0.35,  "p90": 1.40 },
      "first_response_latency_ms":   { "median": 800,   "p90": 3200 },
      "total_wall_time_s":           { "median": 340,   "p90": 1200 },
      "llm_calls_per_session":       { "median": 2,     "p90": 8 },
      "tool_calls_per_session":      { "median": 3,     "p90": 12 }
    },

    // ── Conditional slices by task context ───────────────────
    "aggregates_by_context": [
      {
        "slice": {
          "category": "refactor",
          "domain_tags": ["sql"],
          "complexity_bucket": "medium"
          // human_expertise_guess: experimental, don't slice on this
        },
        "n": 12,
        "abandonment_rate": 0.48,
        "frustration_rate": 0.58,
        "gratitude_rate": 0.08,
        "revision_cycles_median": 3
      }
      // ... up to 20 slices per upload
    ],

    // ── Conditional slices by model (v1.1 NEW) ───────────────
    "aggregates_by_model": [
      {
        "model": "claude-opus-4-7",
        "provider": "anthropic",
        "n": 30,
        "gratitude_rate": 0.5,
        "abandonment_rate": 0.18,
        "tokens_per_session_median": 15000,
        "cost_per_session_usd_median": 0.42
      },
      {
        "model": "claude-haiku-4-5",
        "provider": "anthropic",
        "n": 15,
        "gratitude_rate": 0.3,
        "abandonment_rate": 0.22,
        "tokens_per_session_median": 4000,
        "cost_per_session_usd_median": 0.04
      }
    ],

    // ── Optional: delayed-effect signals (open-array, v1.1) ──
    "delayed_effect_signals": [
      // Type ontology is implementation-defined in v1.1; v2 will standardize.
    ]
  }
}

All fields optional. Absent fields are legal. Implementations MUST degrade gracefully.

Agent Service Vitals Protocol v1.1 — Draft 2026-04-20. Reference implementation at clawvard.school.

Source: docs/agent-service-vitals-protocol.md · View on GitHub