headroom-savings — shrink-agent-context showcase

Real end-to-end run of the chopratejas/headroom compression pipeline (PyPI: headroom-ai v0.22.4, Apache-2.0) on two real local inputs. Token counts come straight from CompressResult.tokens_before / .tokens_after on the installed package — none of the numbers below are hand-typed. Run python public/skills/shrink-agent-context/example/run_showcase.py inside the clawvard.school/courses/shrink-agent-context repo to reproduce it on your own machine.

How $ saved is computed

Network policy (what this run actually contacted)

install : pip install "headroom-ai[all]" (Python wheels only, no model) warmup : skipped (--skip-warmup or --kompress-model disabled) slim default fetches ~158 MB total across TWO repos: · chopratejas/kompress-base allow_patterns=["onnx/*"] ~156 MB · answerdotai/ModernBERT-base allow_patterns=["*.json","tokenizer*"] ~2 MB Both repos are required because headroom-ai's ONNX scoring path is in kompress-base but it hardcodes AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") for tokenization. Skipping ModernBERT-base = router:noop (0% saved) under HF_HUB_OFFLINE=1. The slim mode skips the unused 600 MB PyTorch model.safetensors in kompress-base; --warmup-full pulls the full ~720 MB kompress snapshot if you want it. during compress : zero network — headroom-ai loads both cached models from ~/.cache/huggingface/ and runs on-device. after warmup : safe to set HF_HUB_OFFLINE=1 + TRANSFORMERS_OFFLINE=1 + --skip-warmup for strict offline. Verified locally on a clean HOME / HF_HOME: same savings as the warm run. kompress_model : default (chopratejas/kompress-base)

Model	$ / 1M input	$ saved per call	$ saved per month (30 calls)
claude-sonnet-4-5-20250929	$3.00	$0.0993	$2.98
gpt-4o-2024-08-06	$2.50	$0.0827	$2.48

Agent code search — `grep -rn` over an unfamiliar Python repo

48.3% saved

What an agent gets when you ask it to find every entry point named `compress` / `apply` / `Pipeline` across a fresh repo. Default compression keeps the matched line + a few lines of context, drops the noisy boilerplate.

source: `grep -rn -B 1 -A 8 -E 'def compress|def apply|class Pipeline|class CompressConfig|class CompressResult|class PipelineExtension' --include='*.py' $(python -c 'import headroom, os; print(os.path.dirname(headroom.__file__))')`
profile: generic (compress_user_messages=True, protect_recent=0) · target_ratio: 0.3 · model (for counting): claude-sonnet-4-5-20250929

Tokens before

20,291

Tokens after

10,493

Tokens saved

9,798

Side-by-side excerpts (3 of the largest paragraphs)

EXCERPT 1 · before · 20,291 tok input after · 10,493 tok input

/tmp/headv4/lib/python3.12/site-packages/headroom/cache/compression_cache.py-262- /tmp/headv4/lib/python3.12/site-packages/headroom/cache/compression_cache.py:263: def apply_cached(self, messages: list[dict]) -> list[dict]: /tmp/headv4/lib/python3.12/site-packages/headroom/cache/compression_cache.py-264- """Return a new list with cached compressions swapped into tool results. /tmp/headv4/lib/python3.12/site-packages/headroom/cache/compression_cache.py-265- /tmp/headv4/lib/python3.12/site-packages/headroom/cache/compression_cache.py-266- Never mutates the input list or any message dict within it. /tmp/headv4/lib/python3.12/site-packages/headroom/cache/compression_cache.py-267- Output always has the same length as input. /tmp/headv4/lib/python3.12/site-packages/headroom/cache/compression_cache.py-268- """ /tmp/headv4/lib/python3.12/site-packages/headroom/cach

/tmp/headv4/lib/python3.12/site-packages/headroom/cache/compression_cache.py-262-

retrieval ↳ grep-raw.txt#L1 · compact source: compressed/grep-compact.txt

EXCERPT 2 · before · 20,291 tok input after · 10,493 tok input

/tmp/headv4/lib/python3.12/site-packages/headroom/transforms/smart_crusher.py:798: def apply(

retrieval ↳ grep-raw.txt#L1 · compact source: compressed/grep-compact.txt

EXCERPT 3 · before · 20,291 tok input after · 10,493 tok input

/tmp/headv4/lib/python3.12/site-packages/headroom/proxy/compression_decision.py-150- /tmp/headv4/lib/python3.12/site-packages/headroom/proxy/compression_decision.py-151- /tmp/headv4/lib/python3.12/site-packages/headroom/proxy/compression_decision.py-152- /tmp/headv4/lib/python3.12/site-packages/headroom/proxy/compression_decision.py-153- Mutates ``tags`` in place. No-op when ``should_compress=True`` /tmp/headv4/lib/python3.12/site-packages/headroom/proxy/compression_decision.py-154- /tmp/headv4/lib/python3.12/site-packages/headroom/proxy/compression_decision.py-156- /tmp/headv4/lib/python3.12/site-packages/headroom/proxy/helpers.py-2162- [48 items compressed to 14. Retrieve more: hash=082affce4f35010c4dca5d0a]

retrieval ↳ grep-raw.txt#L1 · compact source: compressed/grep-compact.txt

$ savings on this single run

Model	$ / 1M input	$ / call saved	$ / month saved (30 calls)
claude-sonnet-4-5-20250929	$3.00	$0.0294	$0.88
gpt-4o-2024-08-06	$2.50	$0.0245	$0.73

Transforms applied by headroom-ai (verbatim from CompressResult.transforms_applied)

router:mixed:0.36

Long code-context dump — concatenated source for an agent's RAG step

86.9% saved

A realistic 'agent reads everything I know about library X' blob: the entire compress pipeline + every compression handler concatenated in one user message. Compression keeps the docstrings and signatures the agent needs to reason about behavior, drops the boilerplate.

source: concat: `compress.py + pipeline.py + compression/*.py + compression/handlers/*.py` from the installed `headroom-ai` package
profile: generic (compress_user_messages=True, protect_recent=0) · target_ratio: 0.2 · model (for counting): claude-sonnet-4-5-20250929

Tokens before

26,807

Tokens after

3,510

Tokens saved

23,297

Side-by-side excerpts (3 of the largest paragraphs)

EXCERPT 1 · before · 26,807 tok input after · 3,510 tok input

# ── compress.py ── """One-function compression API for Headroom.

compress.py compression API for Headroom. compress result = compress(messages, model="claude-sonnet-4-5-20250929") result.messages result.tokens_saved Tokens result.compression_ratio # e.g., 0.35 means 65% saved any LLM client, proxy, model="claude-sonnet-4-5-20250929") compressed compress(messages, model="gpt-4o") response = client.chat.completions.create(model="gpt-4o", compress(messages, model="bedrock/claude-sonnet") compress(messages, model="claude-sonnet-4-5-20250929") httpx.post("https://api.anthropic.com/v1/messages", json={ "model": "claude-sonnet-4-5-20250929", last N messages Set 0 to compress everything.""" protect_analysis_context: bool = True target_ratio: float """Keep ratio Kompress. 0.5 = keep 50% (safe 0.7 keep 70% (conservative). affects SmartCrusher min_tokens_to_compress: int = 250 """Minimum token count Default 250. kompress_model: tokens_before: Token count before

retrieval ↳ longctx-raw.txt#L1 · compact source: compressed/longctx-compact.txt

EXCERPT 2 · before · 26,807 tok input after · 3,510 tok input

Attributes: compressed: The compressed content. original: The original content (for reference). compression_ratio: compressed_length / original_length. tokens_before: Estimated token count before compression. tokens_after: Estimated token count after compression. content_type: Detected content type. detection_confidence: Confidence of content type detection. handler_used: Name of structure handler used. preservation_ratio: Fraction of content marked as structural. ccr_key: CCR storage key (if CCR enabled). metadata: Additional metadata. """

_MARKDOWN_LABELS = frozenset(

retrieval ↳ longctx-raw.txt#L1425 · compact source: compressed/longctx-compact.txt

EXCERPT 3 · before · 26,807 tok input after · 3,510 tok input

if body_node: # Signature is from start to body start spans.append( CodeSpan( start=node.start_byte, end=body_node.start_byte, role="signature", is_structural=True, ) ) # Body is compressible spans.append( CodeSpan( start=body_node.start_byte, end=body_node.end_byte, role="body", is_structural=False, ) ) else: # No body fo

token min_token_length: return StructureMask( tokens=tokens, mask=mask, metadata={"source": "entropy", "threshold": threshold}, ) compression/universal.py ── """Universal compressor ML-based detection Detects content type using Magika (ML) Extracts structure handler Compresses non-structural content with Kompress stores original in CCR compressor = UniversalCompressor() result = compressor.compress(content) original_tokens compressed_tokens tokens_saved(self) int: max(0, self.tokens_before savings_percentage(self) self.tokens_before) * 100 class UniversalCompressor: compressor ML detection preservation. 1. Detects content type (JSON, code, logs, text) using ML 2. Extracts structure (keys, signatures, templates) Preserves structure while compressing Stores original for retrieval compressor.compress('{"users": [{"id": 1, "name": "Alice"}]}') content_type is None: detection = self._detector

retrieval ↳ longctx-raw.txt#L2389 · compact source: compressed/longctx-compact.txt

$ savings on this single run

Model	$ / 1M input	$ / call saved	$ / month saved (30 calls)
claude-sonnet-4-5-20250929	$3.00	$0.0699	$2.10
gpt-4o-2024-08-06	$2.50	$0.0582	$1.75

Transforms applied by headroom-ai (verbatim from CompressResult.transforms_applied)

router:mixed:0.14

Generated locally with headroom-ai==0.22.4, tiktoken (cl100k_base), no network calls, no API key, no remote LLM service. The HTML is self-contained — open it offline. Re-running run_showcase.py against the same installed package version reproduces the same numbers deterministically.

headroom-savings · before / after

Combined savings across all cases

$ savings (combined, all cases)

How $ saved is computed

Network policy (what this run actually contacted)

Agent code search — `grep -rn` over an unfamiliar Python repo

Side-by-side excerpts (3 of the largest paragraphs)

$ savings on this single run

Long code-context dump — concatenated source for an agent's RAG step

Side-by-side excerpts (3 of the largest paragraphs)

$ savings on this single run