recorded locally · 0 outbound

Local LLM + Private RAG

A real Ollama session captured on a sandbox laptop on 2026-05-31. Three frames: a streaming chat answer, two RAG queries against a private SQLite index, and the actual chunks.db numbers behind both. Nothing here is a mock — every token below was produced by an open chat model running on this machine, talking only to http://localhost:11434.

chat model qwen2.5:1.5b (Q4_K_M, 986 MB) embed model nomic-embed-text (768-dim, 274 MB) runtime Ollama 0.24.0 · CPU only
Why qwen2.5:1.5b instead of the SOP default qwen3:4b? The capture sandbox had ~2.7 GB free RAM — below Ollama's reported 2.8 GB minimum for the 4 GB model, so the session fell back through the documented chain qwen3:4b → deepseek-r1:1.5b → llama3.2:3b → qwen2.5:1.5b. Run the same scripts on a ≥6 GB free-RAM laptop and the default --model qwen3:4b flag takes over unchanged.

① Streaming chat in your terminal

ollama run · /api/chat · stream:true

A real POST /api/chat exchange. The whole reply streamed token by token through the local socket on your machine. The replay below uses the actual inter-token timestamps captured from the wire (compressed 3× for viewing); the token count and wall-clock seconds on the bottom bar are the un-scaled originals.

~ ollama run qwen2.5:1.5b local · 127.0.0.1:11434
scale 3× · pure JS, no CDN tokens · s · tok/s

② Private RAG with citations

python3 query.py · top-k 5 · cosine ranking

Two real questions answered against the local chunks.db built from a five-document corpus (four NASA public-domain primers + one synthetic private journal standing in for "things you would rather not paste into a cloud LLM"). The answer streams in first; the retrieval trail beneath each panel is the actual top-5 cosine scores the script produced.

~ python3 query.py … saturn-v thrust local-only
scale 4× · honest token-by-token tokens · s
~ python3 query.py … private journal local-only
scale 4× · honest token-by-token tokens · s

③ What's actually in chunks.db

sqlite3 chunks.db — pulled live

Numbers come straight from the SQLite file shipped alongside this showcase (public/skills/local-llm-private-rag/chunks.db). QA can reproduce every value with sqlite3 chunks.db "SELECT COUNT(*) FROM chunks;".

sourcechunkschars