Shrink Your Agent's Context — 压缩 Agent 上下文

You are running shrink-agent-context. Goal: take the giant blob of text the user is about to feed an agent (grep results, log bundles, GitHub issue thread, RAG chunk dump, anything ≥ ~5 k tokens) and compress it locally with the open-source headroom-ai package so the agent reads the same information for 60–95% fewer tokens — and produce a single-file headroom-savings.html before/after report the user can paste to their boss to justify the savings.

When you're done the user should have:

a real compressed copy of their blob (same shape, fewer tokens, ready to paste into their agent),
a real headroom-savings.html they can double-click open: original tokens / compressed tokens / % saved / estimated $ saved at current model prices, with side-by-side excerpts so a reviewer can verify the compression kept the substance, and
a one-line before vs. after they can drop in a PR or a Slack channel.

After a one-time pip install + a HuggingFace cache warmup of two Apache-2.0 repos (~158 MB total, anonymous public download — the ONNX scoring model from chopratejas/kompress-base and the ModernBERT tokenizer JSONs from answerdotai/ModernBERT-base), every compress runs fully on the user's own machine — no API key, no remote LLM, no Clawvard backend. Those install-time fetches are the only network calls in the entire skill.

Iron rules

Fully local after install + warmup. pip install "headroom-ai[all]" pulls the Python wheels (CPU + ONNX runtime); one extra warmup step (§ 1) downloads two Apache-2.0 repos from HuggingFace into the local cache:
1. chopratejas/kompress-base with allow_patterns=["onnx/*"] → ~156 MB ONNX scoring model. The unused 600 MB PyTorch model.safetensors in the same repo is skipped on purpose.
2. answerdotai/ModernBERT-base with allow_patterns=["*.json", "tokenizer*"] → ~2 MB tokenizer JSONs. headroom-ai hardcodes AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") (headroom/transforms/kompress_compressor.py line ~370), so skipping this repo causes strict-offline compress to silently degrade to router:noop (0% saved). Make sure BOTH repos are cached before you trust the warmup.
Total slim warmup: ~158 MB. After that, every subsequent compress runs fully offline — no network calls during compress, no API keys, no remote LLM service, no Clawvard backend. If a step in this SOP would call out to a remote inference endpoint during compress, it is the wrong step.
Don't wrap the package. Drive headroom-ai directly via from headroom import compress (Python) or the bundled CLI. The user is here to learn the real upstream surface, not a Clawvard adapter.
Token counts come from a real counter, not a guess. Use tiktoken (default in the example script) or headroom's own counter; do not write a number in the report you did not measure.
The HTML report writes its own cost formula. Whatever model price you assume (Claude Sonnet $3 / 1M input, GPT-4o $2.50 / 1M input, etc.) goes into the report body so the reader can re-verify it next month when prices change. Don't bake numeric $ savings without showing the formula.
Don't compress what the agent must answer about. The default CompressConfig already protects the last 4 turns and the active user query. If you tell the agent to compress the actual question, you ruin the answer; keep protect_recent ≥ 1 whenever the input is conversational.

1. Prerequisites — pip install + one HuggingFace warmup

Python ≥ 3.10 (python3 --version).

A clean venv to keep headroom-ai's pinned deps off the system Python:

python3 -m venv .headvenv
source .headvenv/bin/activate   # Windows: .headvenv\Scripts\activate
pip install "headroom-ai[all]"

One-time HuggingFace warmup of TWO repos — headroom-ai lazy-loads its ONNX scoring model from chopratejas/kompress-base and the tokenizer from answerdotai/ModernBERT-base on first compress. Both must be cached before strict-offline mode works. Use the slim allow_patterns form the bundled run_showcase.py uses — it skips the unused 600 MB PyTorch model.safetensors from kompress-base and the ~600 MB safetensors from ModernBERT-base, and just grabs the ~156 MB onnx/kompress-int8.onnx + ~2 MB tokenizer JSONs:
```
python -c "from huggingface_hub import snapshot_download; \
  snapshot_download('chopratejas/kompress-base', allow_patterns=['onnx/*']); \
  snapshot_download('answerdotai/ModernBERT-base', allow_patterns=['*.json', 'tokenizer*'])"
```
This pulls ~158 MB of Apache-2.0 weights total into ~/.cache/huggingface/. Anonymous, unauthenticated download from the public HF Hub — no HF token required. On flaky HF egress the slim form completes in a few minutes; an unpatterned snapshot_download('chopratejas/kompress-base') would pull ~720 MB and tends to stall, which is why the slim form is the SOP default. Both repos must finish — if you only warm chopratejas/kompress-base and skip ModernBERT-base, strict-offline compress silently degrades to router:noop (0% saved). this is the expected failure mode. After both fetches, set HF_HUB_OFFLINE=1 if you want strict-offline guarantees:
```
export HF_HUB_OFFLINE=1   # subsequent runs forbid any HF Hub call
```
No commercial API key. No Clawvard credits consumed. No private repo to clone. The package lives at https://github.com/chopratejas/headroom (Apache-2.0) and on PyPI as headroom-ai; the ONNX scoring weights live at https://huggingface.co/chopratejas/kompress-base (Apache-2.0); the tokenizer lives at https://huggingface.co/answerdotai/ModernBERT-base (Apache-2.0).

On Ubuntu 24+ system Python you'll hit PEP 668 if you skip the venv (error: externally-managed-environment). Use python3 -m venv or pipx install headroom-ai.

What if I genuinely cannot reach HuggingFace once? Either copy ~/.cache/huggingface/hub/models--chopratejas--kompress-base/ from a machine that has, or run with CompressConfig(kompress_model="disabled") — that falls back to SmartCrusher + CacheAligner only. SmartCrusher only finds savings in JSON arrays / structured RAG dumps, so for plain prose / code / log text the disabled path lands at router:noop and saves 0% (verified locally on this very showcase's inputs). Don't promise the user big savings on the disabled path unless their input is JSON-shaped. The bundled run_showcase.py exposes both options via --skip-warmup and --kompress-model disabled so you can see the difference yourself.

2. Decide what you're compressing

Pin three things in one sentence each before you write any code:

Input — one or more local text files. Typical shapes: a grep -rn dump from a code search (~10–20 k tokens), an SRE incident bundle (~50–70 k tokens of logs + traceback + kubectl describe), a long GitHub issue thread (~50–60 k tokens), or a RAG chunk dump.
Profile — generic (any text), log-aware (dedup repeated stack frames, collapse runs of same-level INFO, always keep ERROR/FATAL verbatim), or code-aware (preserve identifiers + signatures). The headroom-ai pipeline picks profile automatically per message; you tune via CompressConfig.
Target ratio — 0.5 keeps 50% (conservative, document/RAG work), 0.3 is the comfortable middle for tool output, 0.15 is aggressive for noisy logs. Lower number → fewer tokens kept → bigger savings → higher risk of dropping signal.

3. The smallest possible Python script

The full reference script ships at example/run_showcase.py; it produces the bundled headroom-savings.html showcase and runs the HF warmup automatically on first invocation. The 12-line essence:

from pathlib import Path
import tiktoken
from headroom import compress, CompressConfig

raw = Path("inputs/grep-raw.txt").read_text()
enc = tiktoken.get_encoding("cl100k_base")
tokens_in = len(enc.encode(raw))

result = compress(
    messages=[{"role": "user", "content": raw}],
    model="claude-sonnet-4-5-20250929",
    config=CompressConfig(
        compress_user_messages=True,   # the whole blob IS the user message
        protect_recent=0,              # nothing here is "active conversation"
        target_ratio=0.3,              # keep ~30% of tokens
    ),
)

compact = result.messages[0]["content"]
Path("compressed/grep-compact.txt").write_text(
    compact if isinstance(compact, str) else "\n".join(b.get("text", "") for b in compact)
)
print(f"{tokens_in} → {result.tokens_after} ({100 * result.compression_ratio:.1f}% saved)")

result.tokens_before / result.tokens_after / result.tokens_saved / result.compression_ratio / result.transforms_applied come from headroom's own counter — that's the number you put in the report, not a re-tokenize.

4. CLI path (when you don't want to write Python)

headroom ships a compress subcommand for the same pipeline:

headroom compress \
  --model claude-sonnet-4-5-20250929 \
  --target-ratio 0.3 \
  --compress-user-messages \
  --input inputs/grep-raw.txt \
  --output compressed/grep-compact.txt \
  --metrics compressed/grep-metrics.json

Run headroom compress --help for the exact flag names in your installed version — the upstream package adds flags often (the SOP was authored against headroom-ai==0.22.4, latest v0.22.4 release dated 2026-06-01).

5. The HTML report — single file, self-contained

Run the bundled showcase generator to fold the metrics into a paste-to-boss report:

python example/run_showcase.py --out headroom-savings.html

The script first runs the HF warmup (no-op if you already ran step § 1), then compresses two real inputs derived from the installed headroom-ai source, and writes a single self-contained HTML report. Add --skip-warmup to skip the snapshot (useful in strict-offline pipelines once the cache is populated) or --kompress-model disabled to fall back to the no-network SmartCrusher path.

The script writes one self-contained HTML (no CDN, no external fonts, no API keys) with, per case:

tokens_before / tokens_after / % saved — measured by tiktoken + headroom's counter, not hand-typed.
$ saved — using the explicit formula written in the report body, so a reviewer can re-verify it. Default formula: (tokens_before - tokens_after) / 1_000_000 * input_price_usd_per_million_tokens, assuming the user runs the agent N times per month (configurable).
3+ side-by-side excerpts — left = original chunk, right = compressed chunk, color-coded.
Retrieval URLs — the script bundles each input as inputs/<case>-raw.txt and writes anchor links so a reviewer can jump back to the original chunk by line number.
Which transforms ran (Kompress, SmartCrusher, CacheAligner, …) — surfaced verbatim from result.transforms_applied, so the reader can see WHAT compressed the bytes.

The HTML is fully self-contained: it inlines CSS, escapes user content, and uses base64 only if you opted into images. Open it locally; no network required.

6. Wire it back into your agent

Once the report looks good, the compressed file is the thing you feed the agent. Typical flows:

Claude Code / Cursor: paste the contents of compressed/grep-compact.txt instead of the raw blob.
Anthropic SDK in your own code: use compress(messages=..., model="claude-sonnet-4-5-20250929") inline before client.messages.create(...); the package was designed for this exact pattern, see compress.py examples.
OpenAI / LiteLLM: same shape, swap the model= string; the API is provider-agnostic because it only counts tokens locally.

You're not changing your prompt or your retrieval pipeline — you're inserting one function call between "I have a giant blob" and "I send it to the LLM."

7. Validate the compressed output is actually useful

Compression is only a win if the agent can still answer correctly. After compressing:

Re-run the original question on the compressed input.
Compare to the answer from the raw input (or to ground truth you already know).
If quality dropped, raise target_ratio (keep more), or flip protect_analysis_context=True (keeps code blocks intact when the user query says "analyze" / "review"), or pin kompress_model="disabled" to fall back to JSON/array dedup only.

The transforms_applied list in CompressResult tells you exactly which stages ran — if Kompress collapsed something important, you can disable it and rerun.

8. What to NOT compress

The active user turn / the question itself (the default protect_recent=4 already handles this for chat-shaped inputs).
System prompts you depend on for behavior — set compress_system_messages=False.
Short messages — the default min_tokens_to_compress=250 skips anything under ~1 paragraph.
Already-summarized content — compressing a summary tends to drop facts, not boilerplate. Compress at the source.

9. Troubleshooting

ModuleNotFoundError: fastapi — you installed plain headroom-ai instead of headroom-ai[all]. The [all] extra pulls the optional proxy/server deps. For the compress-only path you can also pip install fastapi uvicorn manually.
Savings < 30% — your input was probably already compact. Either raise the threshold (min_tokens_to_compress lower) or accept that this input doesn't benefit. Don't lie about savings on the report.
The compressed output dropped the wrong line — keep ERROR/FATAL by switching to log-aware via CompressConfig(target_ratio=0.2, protect_analysis_context=True), or pre-split the file so ERROR lines are in their own message and won't be touched.
PEP 668 / externally-managed-environment — you're on Ubuntu 24+ system Python. Use python3 -m venv (see §1) or pipx install headroom-ai.

学习完成后

Tell the user:

我已经学会了 shrink-agent-context。指一个本地纯文本文件（grep / log / RAG chunk / 任意 ≥ 5 k token 的输入）给我，我用 headroom-ai 在你电脑上本地压一遍，输出 compressed/<name>-compact.txt + 单文件 headroom-savings.html before/after 报告：原 token / 压缩后 token / 节省百分比 / 按你指定的现行 input price 折算的 $ 节省（公式写在报告里）/ 至少 3 段 side-by-side 节选 / 回原始 chunk 的 retrieval 锚点。除首次 pip install 与一次性 HF model warmup（两个 Apache-2.0 公开 repo，~158 MB 合计：chopratejas/kompress-base 的 ONNX + answerdotai/ModernBERT-base 的 tokenizer JSON，两个都必须 warmup 才能离线工作）外，compress 阶段全程本地、零 API key、零远端 LLM、零 Clawvard 后端；warmup 完成后可 HF_HUB_OFFLINE=1 严格离线。