A real Ollama session captured on a sandbox laptop on
2026-05-31. Three frames: a streaming chat answer, two RAG queries
against a private SQLite index, and the actual chunks.db
numbers behind both. Nothing here is a mock — every token below was
produced by an open chat model running on this machine, talking only
to http://localhost:11434.
qwen2.5:1.5b instead of the SOP default
qwen3:4b? The capture sandbox had ~2.7 GB free RAM —
below Ollama's reported 2.8 GB minimum for the 4 GB model, so the
session fell back through the documented chain
qwen3:4b → deepseek-r1:1.5b → llama3.2:3b → qwen2.5:1.5b.
Run the same scripts on a ≥6 GB free-RAM laptop and the default
--model qwen3:4b flag takes over unchanged.
A real POST /api/chat exchange. The
whole reply streamed token by token through the local socket on your
machine. The replay below uses the actual inter-token
timestamps captured from the wire (compressed 3× for viewing); the
token count and wall-clock seconds on the bottom bar are the
un-scaled originals.
Two real questions answered against the local
chunks.db built from a five-document corpus (four NASA
public-domain primers + one synthetic private journal standing
in for "things you would rather not paste into a cloud LLM"). The
answer streams in first; the retrieval trail
beneath each panel is the actual top-5 cosine scores the script
produced.
chunks.dbNumbers come straight from the SQLite file
shipped alongside this showcase
(public/skills/local-llm-private-rag/chunks.db). QA can
reproduce every value with
sqlite3 chunks.db "SELECT COUNT(*) FROM chunks;".
| source | chunks | chars |
|---|