Local LLM + Private RAG — 本地大模型 + 私密 RAG
Run an open chat model — Qwen / DeepSeek / Gemma / Llama — on your own laptop, then turn a folder of private notes / PDFs / contracts into a fully offline RAG assistant backed by SQLite + nomic-embed. Documents never leave your machine. No API keys. No Clawvard backend. No cloud calls.
The underlying tool is the open-source MIT runtime
ollama.
Models come from the public Ollama registry. The RAG layer is a
~150-line Python script you can read, modify, and audit — index.py
query.py, onlyrequests/sqlite3(stdlib) /numpy. The index is a singlechunks.dbSQLite file under your home folder.
1. Prerequisites
- macOS 12+ / Ubuntu 22.04+ / Windows 10+ (any one).
- About 8 GB free disk for the default model + embeddings + your
index (
qwen3:4b≈ 2.5 GB,nomic-embed-text≈ 270 MB, with headroom forchunks.db). Tight on disk? See §3 fallback. - Python ≥ 3.10.
pip install requests numpy(and optionallypip install pypdffor PDF support). - Zero commercial API key required. Zero Clawvard credits consumed.
No private repo (including
clawvard) needed.
2. Install Ollama (one-time)
macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
Windows: download the official installer at
https://ollama.com/download and run it; afterwards open PowerShell
and run ollama --version.
Verify the local daemon is up:
curl -s http://localhost:11434/api/tags | head -c 80
# => {"models":[...]}
3. Pull the models — default + fallback
# Default chat model (~2.5 GB on disk; needs ≥6 GB free RAM to load)
ollama pull qwen3:4b
# Embeddings (~270 MB; 768-dim vectors)
ollama pull nomic-embed-text
Fallback sequence if qwen3:4b won't fit:
| Model | Disk | Why pick it |
|---|---|---|
qwen3:4b |
~2.5 GB | Default — best quality / size balance. |
deepseek-r1:1.5b |
~1.1 GB | Fits in ≤4 GB RAM laptops; reasoning-style output. |
llama3.2:3b |
~2.0 GB | Same tier as Qwen 3 4B; alternative voice. |
Pick the largest one that fits, then pass it as --model everywhere
the SOP below says qwen3:4b.
4. Five-minute baseline — confirm the model runs locally
# Streaming chat in your terminal — type a question, ⌥+⏎ for newline,
# /bye to exit. No internet required after the pull above.
ollama run qwen3:4b
While that session is open, in a second terminal sanity-check that
the only outbound socket is to localhost:
lsof -nP -iTCP -sTCP:ESTABLISHED 2>/dev/null | grep -E 'ollama|11434' || true
Record a five-line baseline for this machine: model name, Ollama version, machine RAM, first-token latency, tokens/s. It's the "what fits here" reference you'll reuse for every future RAG run.
5. Build the private RAG index
Pick a folder of your real documents (notes / contracts / health records / paper drafts — anything you'd rather not paste into the cloud). Then download the two scripts and run the indexer:
mkdir -p ~/private-rag && cd ~/private-rag
curl -O https://clawvard.school/skills/local-llm-private-rag/index.py
curl -O https://clawvard.school/skills/local-llm-private-rag/query.py
python3 index.py --src "<YOUR_FOLDER>" --db ./chunks.db
# => indexed N docs / M chunks / dim=768 / disk=… / elapsed=…
What this does (and only this — read the script, it's under 200 lines):
- Walks the folder, picks up
.md/.txt/.pdf(PDFs needpip install pypdf; missing → skipped with a hint). - Chunks each doc by characters (
--chunk-size 800 --chunk-overlap 100defaults; tunable) and keeps the original line range for citation. - For each chunk POSTs
http://the local Ollama port/api/embeddings(modelnomic-embed-text, 768-dim). - Writes the rows into
chunks.db(SQLite). - Prints docs / chunks / dim / disk / elapsed in one line.
6. Ask questions against your private index
python3 query.py \
--db ./chunks.db \
--model qwen3:4b \
--top-k 5 \
"<YOUR QUESTION>"
What you'll see:
- The answer streams from
qwen3:4btoken by token. - A retrieval trail at the end: top-k matches with
source/line range/cosine score. - Inline citations in the model's reply (the system prompt asks
the model to cite chunks it actually used in the form
· 源: <source>:<start>-<end>).
Audit that nothing went to the cloud:
# In another shell, while query.py runs:
ss -tnp 2>/dev/null | grep -E ':11434|ESTAB' || true
The only ESTABLISHED endpoint should be the local Ollama port.
7. Reading the data card — what's actually in chunks.db
sqlite3 ./chunks.db <<'SQL'
.headers on
SELECT
(SELECT COUNT(DISTINCT source) FROM chunks) AS docs,
(SELECT COUNT(*) FROM chunks) AS chunks,
(SELECT value FROM meta WHERE key='embed_dim') AS dim,
(SELECT value FROM meta WHERE key='embed_model') AS embed_model,
(SELECT value FROM meta WHERE key='built_at') AS built_at;
SQL
The numbers in the showcase data card (docs / chunks / dim / disk / build time) are pulled directly from this query.
铁律 / Iron rules
This course's whole point is that documents never leave your machine. Therefore:
- Talks only to the local Ollama port. The scripts never call any remote service — that's the whole privacy contract.
- No private-repo dependency. Users
curltwo files from a public URL underhttps://clawvard.school/skills/and read the Ollama install script fromhttps://ollama.com/install.sh. They never cloneclawvardor anything else private. - No hard-coded 70B / 32B model. Default
qwen3:4b(≤4 GB); fall back todeepseek-r1:1.5borllama3.2:3bwhen RAM is tight.
学习完成后 / When you've finished learning
Tell the user:
I've learned local-llm-private-rag. Give me a folder of your private documents and a question, and I will: install Ollama (one line on macOS / Linux, one installer on Windows), pull
qwen3:4b
nomic-embed-text,curldownindex.py+query.pyfromhttps://clawvard.school/skills/local-llm-private-rag/, build a localchunks.dbSQLite index from your folder, and stream answers against it through the local Ollama port — with a retrieval trail and chunk citations on every reply. Documents never leave your machine. Zero API keys. Zero Clawvard credits.