Shrink Your Agent's Context — 压缩 Agent 上下文
You are running shrink-agent-context. Goal: take the giant blob of text the user is about to feed an agent (grep results, log bundles, GitHub issue thread, RAG chunk dump, anything ≥ ~5 k tokens) and compress it locally with the open-source headroom-ai package so the agent reads the same information for 60–95% fewer tokens — and produce a single-file headroom-savings.html before/after report the user can paste to their boss to justify the savings.
When you're done the user should have:
- a real compressed copy of their blob (same shape, fewer tokens, ready to paste into their agent),
- a real
headroom-savings.htmlthey can double-click open: original tokens / compressed tokens / % saved / estimated $ saved at current model prices, with side-by-side excerpts so a reviewer can verify the compression kept the substance, and - a one-line
before vs. afterthey can drop in a PR or a Slack channel.
After a one-time pip install + a HuggingFace cache warmup of two Apache-2.0 repos (~158 MB total, anonymous public download — the ONNX scoring model from chopratejas/kompress-base and the ModernBERT tokenizer JSONs from answerdotai/ModernBERT-base), every compress runs fully on the user's own machine — no API key, no remote LLM, no Clawvard backend. Those install-time fetches are the only network calls in the entire skill.
Iron rules
-
Fully local after install + warmup.
pip install "headroom-ai[all]"pulls the Python wheels (CPU + ONNX runtime); one extra warmup step (§ 1) downloads two Apache-2.0 repos from HuggingFace into the local cache:chopratejas/kompress-basewithallow_patterns=["onnx/*"]→ ~156 MB ONNX scoring model. The unused 600 MB PyTorchmodel.safetensorsin the same repo is skipped on purpose.answerdotai/ModernBERT-basewithallow_patterns=["*.json", "tokenizer*"]→ ~2 MB tokenizer JSONs. headroom-ai hardcodesAutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")(headroom/transforms/kompress_compressor.pyline ~370), so skipping this repo causes strict-offline compress to silently degrade torouter:noop(0% saved). Make sure BOTH repos are cached before you trust the warmup.
Total slim warmup: ~158 MB. After that, every subsequent compress runs fully offline — no network calls during compress, no API keys, no remote LLM service, no Clawvard backend. If a step in this SOP would call out to a remote inference endpoint during compress, it is the wrong step.
-
Don't wrap the package. Drive
headroom-aidirectly viafrom headroom import compress(Python) or the bundled CLI. The user is here to learn the real upstream surface, not a Clawvard adapter. -
Token counts come from a real counter, not a guess. Use
tiktoken(default in the example script) orheadroom's own counter; do not write a number in the report you did not measure. -
The HTML report writes its own cost formula. Whatever model price you assume (Claude Sonnet
$3 / 1M input, GPT-4o$2.50 / 1M input, etc.) goes into the report body so the reader can re-verify it next month when prices change. Don't bake numeric$savings without showing the formula. -
Don't compress what the agent must answer about. The default
CompressConfigalready protects the last 4 turns and the active user query. If you tell the agent to compress the actual question, you ruin the answer; keepprotect_recent ≥ 1whenever the input is conversational.
1. Prerequisites — pip install + one HuggingFace warmup
- Python ≥ 3.10 (
python3 --version). - A clean venv to keep
headroom-ai's pinned deps off the system Python:python3 -m venv .headvenv source .headvenv/bin/activate # Windows: .headvenv\Scripts\activate pip install "headroom-ai[all]" - One-time HuggingFace warmup of TWO repos —
headroom-ailazy-loads its ONNX scoring model fromchopratejas/kompress-baseand the tokenizer fromanswerdotai/ModernBERT-baseon first compress. Both must be cached before strict-offline mode works. Use the slimallow_patternsform the bundledrun_showcase.pyuses — it skips the unused 600 MB PyTorchmodel.safetensorsfrom kompress-base and the ~600 MB safetensors from ModernBERT-base, and just grabs the ~156 MBonnx/kompress-int8.onnx+ ~2 MB tokenizer JSONs:This pulls ~158 MB of Apache-2.0 weights total intopython -c "from huggingface_hub import snapshot_download; \ snapshot_download('chopratejas/kompress-base', allow_patterns=['onnx/*']); \ snapshot_download('answerdotai/ModernBERT-base', allow_patterns=['*.json', 'tokenizer*'])"~/.cache/huggingface/. Anonymous, unauthenticated download from the public HF Hub — no HF token required. On flaky HF egress the slim form completes in a few minutes; an unpatternedsnapshot_download('chopratejas/kompress-base')would pull ~720 MB and tends to stall, which is why the slim form is the SOP default. Both repos must finish — if you only warmchopratejas/kompress-baseand skip ModernBERT-base, strict-offline compress silently degrades torouter:noop(0% saved). this is the expected failure mode. After both fetches, setHF_HUB_OFFLINE=1if you want strict-offline guarantees:export HF_HUB_OFFLINE=1 # subsequent runs forbid any HF Hub call - No commercial API key. No Clawvard credits consumed. No private repo to clone. The package lives at https://github.com/chopratejas/headroom (Apache-2.0) and on PyPI as
headroom-ai; the ONNX scoring weights live at https://huggingface.co/chopratejas/kompress-base (Apache-2.0); the tokenizer lives at https://huggingface.co/answerdotai/ModernBERT-base (Apache-2.0).
On Ubuntu 24+ system Python you'll hit PEP 668 if you skip the venv (
error: externally-managed-environment). Usepython3 -m venvorpipx install headroom-ai.
What if I genuinely cannot reach HuggingFace once? Either copy
~/.cache/huggingface/hub/models--chopratejas--kompress-base/from a machine that has, or run withCompressConfig(kompress_model="disabled")— that falls back to SmartCrusher + CacheAligner only. SmartCrusher only finds savings in JSON arrays / structured RAG dumps, so for plain prose / code / log text the disabled path lands atrouter:noopand saves 0% (verified locally on this very showcase's inputs). Don't promise the user big savings on the disabled path unless their input is JSON-shaped. The bundledrun_showcase.pyexposes both options via--skip-warmupand--kompress-model disabledso you can see the difference yourself.
2. Decide what you're compressing
Pin three things in one sentence each before you write any code:
- Input — one or more local text files. Typical shapes: a
grep -rndump from a code search (~10–20 k tokens), an SRE incident bundle (~50–70 k tokens of logs + traceback +kubectl describe), a long GitHub issue thread (~50–60 k tokens), or a RAG chunk dump. - Profile —
generic(any text),log-aware(dedup repeated stack frames, collapse runs of same-level INFO, always keep ERROR/FATAL verbatim), orcode-aware(preserve identifiers + signatures). Theheadroom-aipipeline picks profile automatically per message; you tune viaCompressConfig. - Target ratio —
0.5keeps 50% (conservative, document/RAG work),0.3is the comfortable middle for tool output,0.15is aggressive for noisy logs. Lower number → fewer tokens kept → bigger savings → higher risk of dropping signal.
3. The smallest possible Python script
The full reference script ships at example/run_showcase.py; it produces the bundled headroom-savings.html showcase and runs the HF warmup automatically on first invocation. The 12-line essence:
from pathlib import Path
import tiktoken
from headroom import compress, CompressConfig
raw = Path("inputs/grep-raw.txt").read_text()
enc = tiktoken.get_encoding("cl100k_base")
tokens_in = len(enc.encode(raw))
result = compress(
messages=[{"role": "user", "content": raw}],
model="claude-sonnet-4-5-20250929",
config=CompressConfig(
compress_user_messages=True, # the whole blob IS the user message
protect_recent=0, # nothing here is "active conversation"
target_ratio=0.3, # keep ~30% of tokens
),
)
compact = result.messages[0]["content"]
Path("compressed/grep-compact.txt").write_text(
compact if isinstance(compact, str) else "\n".join(b.get("text", "") for b in compact)
)
print(f"{tokens_in} → {result.tokens_after} ({100 * result.compression_ratio:.1f}% saved)")
result.tokens_before / result.tokens_after / result.tokens_saved / result.compression_ratio / result.transforms_applied come from headroom's own counter — that's the number you put in the report, not a re-tokenize.
4. CLI path (when you don't want to write Python)
headroom ships a compress subcommand for the same pipeline:
headroom compress \
--model claude-sonnet-4-5-20250929 \
--target-ratio 0.3 \
--compress-user-messages \
--input inputs/grep-raw.txt \
--output compressed/grep-compact.txt \
--metrics compressed/grep-metrics.json
Run headroom compress --help for the exact flag names in your installed version — the upstream package adds flags often (the SOP was authored against headroom-ai==0.22.4, latest v0.22.4 release dated 2026-06-01).
5. The HTML report — single file, self-contained
Run the bundled showcase generator to fold the metrics into a paste-to-boss report:
python example/run_showcase.py --out headroom-savings.html
The script first runs the HF warmup (no-op if you already ran step § 1), then compresses two real inputs derived from the installed headroom-ai source, and writes a single self-contained HTML report. Add --skip-warmup to skip the snapshot (useful in strict-offline pipelines once the cache is populated) or --kompress-model disabled to fall back to the no-network SmartCrusher path.
The script writes one self-contained HTML (no CDN, no external fonts, no API keys) with, per case:
tokens_before/tokens_after/% saved— measured bytiktoken+headroom's counter, not hand-typed.$ saved— using the explicit formula written in the report body, so a reviewer can re-verify it. Default formula:(tokens_before - tokens_after) / 1_000_000 * input_price_usd_per_million_tokens, assuming the user runs the agentNtimes per month (configurable).- 3+ side-by-side excerpts — left = original chunk, right = compressed chunk, color-coded.
- Retrieval URLs — the script bundles each input as
inputs/<case>-raw.txtand writes anchor links so a reviewer can jump back to the original chunk by line number. - Which transforms ran (
Kompress,SmartCrusher,CacheAligner, …) — surfaced verbatim fromresult.transforms_applied, so the reader can see WHAT compressed the bytes.
The HTML is fully self-contained: it inlines CSS, escapes user content, and uses base64 only if you opted into images. Open it locally; no network required.
6. Wire it back into your agent
Once the report looks good, the compressed file is the thing you feed the agent. Typical flows:
- Claude Code / Cursor: paste the contents of
compressed/grep-compact.txtinstead of the raw blob. - Anthropic SDK in your own code: use
compress(messages=..., model="claude-sonnet-4-5-20250929")inline beforeclient.messages.create(...); the package was designed for this exact pattern, seecompress.pyexamples. - OpenAI / LiteLLM: same shape, swap the
model=string; the API is provider-agnostic because it only counts tokens locally.
You're not changing your prompt or your retrieval pipeline — you're inserting one function call between "I have a giant blob" and "I send it to the LLM."
7. Validate the compressed output is actually useful
Compression is only a win if the agent can still answer correctly. After compressing:
- Re-run the original question on the compressed input.
- Compare to the answer from the raw input (or to ground truth you already know).
- If quality dropped, raise
target_ratio(keep more), or flipprotect_analysis_context=True(keeps code blocks intact when the user query says "analyze" / "review"), or pinkompress_model="disabled"to fall back to JSON/array dedup only.
The transforms_applied list in CompressResult tells you exactly which stages ran — if Kompress collapsed something important, you can disable it and rerun.
8. What to NOT compress
- The active user turn / the question itself (the default
protect_recent=4already handles this for chat-shaped inputs). - System prompts you depend on for behavior — set
compress_system_messages=False. - Short messages — the default
min_tokens_to_compress=250skips anything under ~1 paragraph. - Already-summarized content — compressing a summary tends to drop facts, not boilerplate. Compress at the source.
9. Troubleshooting
ModuleNotFoundError: fastapi— you installed plainheadroom-aiinstead ofheadroom-ai[all]. The[all]extra pulls the optional proxy/server deps. For the compress-only path you can alsopip install fastapi uvicornmanually.- Savings < 30% — your input was probably already compact. Either raise the threshold (
min_tokens_to_compresslower) or accept that this input doesn't benefit. Don't lie about savings on the report. - The compressed output dropped the wrong line — keep ERROR/FATAL by switching to log-aware via
CompressConfig(target_ratio=0.2, protect_analysis_context=True), or pre-split the file so ERROR lines are in their own message and won't be touched. - PEP 668 / externally-managed-environment — you're on Ubuntu 24+ system Python. Use
python3 -m venv(see §1) orpipx install headroom-ai.
学习完成后
Tell the user:
我已经学会了 shrink-agent-context。指一个本地纯文本文件(grep / log / RAG chunk / 任意 ≥ 5 k token 的输入)给我,我用
headroom-ai在你电脑上本地压一遍,输出compressed/<name>-compact.txt+ 单文件headroom-savings.htmlbefore/after 报告:原 token / 压缩后 token / 节省百分比 / 按你指定的现行 input price 折算的 $ 节省(公式写在报告里)/ 至少 3 段 side-by-side 节选 / 回原始 chunk 的 retrieval 锚点。除首次pip install与一次性 HF model warmup(两个 Apache-2.0 公开 repo,~158 MB 合计:chopratejas/kompress-base的 ONNX +answerdotai/ModernBERT-base的 tokenizer JSON,两个都必须 warmup 才能离线工作)外,compress 阶段全程本地、零 API key、零远端 LLM、零 Clawvard 后端;warmup 完成后可HF_HUB_OFFLINE=1严格离线。