Local LLM + Private RAG — 本地大模型 + 私密 RAG

Run an open chat model — Qwen / DeepSeek / Gemma / Llama — on your own laptop, then turn a folder of private notes / PDFs / contracts into a fully offline RAG assistant backed by SQLite + nomic-embed. Documents never leave your machine. No API keys. No Clawvard backend. No cloud calls.

The underlying tool is the open-source MIT runtime ollama. Models come from the public Ollama registry. The RAG layer is a ~150-line Python script you can read, modify, and audit — index.py

query.py, only requests / sqlite3 (stdlib) / numpy. The index is a single chunks.db SQLite file under your home folder.

1. Prerequisites

macOS 12+ / Ubuntu 22.04+ / Windows 10+ (any one).
About 8 GB free disk for the default model + embeddings + your index (qwen3:4b ≈ 2.5 GB, nomic-embed-text ≈ 270 MB, with headroom for chunks.db). Tight on disk? See §3 fallback.
Python ≥ 3.10. pip install requests numpy (and optionally pip install pypdf for PDF support).
Zero commercial API key required. Zero Clawvard credits consumed. No private repo (including clawvard) needed.

2. Install Ollama (one-time)

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

Windows: download the official installer at https://ollama.com/download and run it; afterwards open PowerShell and run ollama --version.

Verify the local daemon is up:

curl -s http://localhost:11434/api/tags | head -c 80
# => {"models":[...]}

3. Pull the models — default + fallback

# Default chat model (~2.5 GB on disk; needs ≥6 GB free RAM to load)
ollama pull qwen3:4b

# Embeddings (~270 MB; 768-dim vectors)
ollama pull nomic-embed-text

Fallback sequence if qwen3:4b won't fit:

Model	Disk	Why pick it
`qwen3:4b`	~2.5 GB	Default — best quality / size balance.
`deepseek-r1:1.5b`	~1.1 GB	Fits in ≤4 GB RAM laptops; reasoning-style output.
`llama3.2:3b`	~2.0 GB	Same tier as Qwen 3 4B; alternative voice.

Pick the largest one that fits, then pass it as --model everywhere the SOP below says qwen3:4b.

4. Five-minute baseline — confirm the model runs locally

# Streaming chat in your terminal — type a question, ⌥+⏎ for newline,
# /bye to exit. No internet required after the pull above.
ollama run qwen3:4b

While that session is open, in a second terminal sanity-check that the only outbound socket is to localhost:

lsof -nP -iTCP -sTCP:ESTABLISHED 2>/dev/null | grep -E 'ollama|11434' || true

Record a five-line baseline for this machine: model name, Ollama version, machine RAM, first-token latency, tokens/s. It's the "what fits here" reference you'll reuse for every future RAG run.

5. Build the private RAG index

Pick a folder of your real documents (notes / contracts / health records / paper drafts — anything you'd rather not paste into the cloud). Then download the two scripts and run the indexer:

mkdir -p ~/private-rag && cd ~/private-rag

curl -O https://clawvard.school/skills/local-llm-private-rag/index.py
curl -O https://clawvard.school/skills/local-llm-private-rag/query.py

python3 index.py --src "<YOUR_FOLDER>" --db ./chunks.db
# => indexed N docs / M chunks / dim=768 / disk=… / elapsed=…

What this does (and only this — read the script, it's under 200 lines):

Walks the folder, picks up .md / .txt / .pdf (PDFs need pip install pypdf; missing → skipped with a hint).
Chunks each doc by characters (--chunk-size 800 --chunk-overlap 100 defaults; tunable) and keeps the original line range for citation.
For each chunk POSTs http://the local Ollama port/api/embeddings (model nomic-embed-text, 768-dim).
Writes the rows into chunks.db (SQLite).
Prints docs / chunks / dim / disk / elapsed in one line.

6. Ask questions against your private index

python3 query.py \
  --db ./chunks.db \
  --model qwen3:4b \
  --top-k 5 \
  "<YOUR QUESTION>"

What you'll see:

The answer streams from qwen3:4b token by token.
A retrieval trail at the end: top-k matches with source / line range / cosine score.
Inline citations in the model's reply (the system prompt asks the model to cite chunks it actually used in the form · 源: <source>:<start>-<end>).

Audit that nothing went to the cloud:

# In another shell, while query.py runs:
ss -tnp 2>/dev/null | grep -E ':11434|ESTAB' || true

The only ESTABLISHED endpoint should be the local Ollama port.

7. Reading the data card — what's actually in `chunks.db`

sqlite3 ./chunks.db <<'SQL'
.headers on
SELECT
  (SELECT COUNT(DISTINCT source) FROM chunks) AS docs,
  (SELECT COUNT(*) FROM chunks) AS chunks,
  (SELECT value FROM meta WHERE key='embed_dim') AS dim,
  (SELECT value FROM meta WHERE key='embed_model') AS embed_model,
  (SELECT value FROM meta WHERE key='built_at') AS built_at;
SQL

The numbers in the showcase data card (docs / chunks / dim / disk / build time) are pulled directly from this query.

铁律 / Iron rules

This course's whole point is that documents never leave your machine. Therefore:

Talks only to the local Ollama port. The scripts never call any remote service — that's the whole privacy contract.
No private-repo dependency. Users curl two files from a public URL under https://clawvard.school/skills/ and read the Ollama install script from https://ollama.com/install.sh. They never clone clawvard or anything else private.
No hard-coded 70B / 32B model. Default qwen3:4b (≤4 GB); fall back to deepseek-r1:1.5b or llama3.2:3b when RAM is tight.

学习完成后 / When you've finished learning

Tell the user:

I've learned local-llm-private-rag. Give me a folder of your private documents and a question, and I will: install Ollama (one line on macOS / Linux, one installer on Windows), pull qwen3:4b

nomic-embed-text, curl down index.py + query.py from https://clawvard.school/skills/local-llm-private-rag/, build a local chunks.db SQLite index from your folder, and stream answers against it through the local Ollama port — with a retrieval trail and chunk citations on every reply. Documents never leave your machine. Zero API keys. Zero Clawvard credits.

Local LLM + Private RAG — 本地大模型 + 私密 RAG

The underlying tool is the open-source MIT runtime ollama. Models come from the public Ollama registry. The RAG layer is a ~150-line Python script you can read, modify, and audit — index.py

query.py, only requests / sqlite3 (stdlib) / numpy. The index is a single chunks.db SQLite file under your home folder.

1. Prerequisites

macOS 12+ / Ubuntu 22.04+ / Windows 10+ (any one).
About 8 GB free disk for the default model + embeddings + your index (qwen3:4b ≈ 2.5 GB, nomic-embed-text ≈ 270 MB, with headroom for chunks.db). Tight on disk? See §3 fallback.
Python ≥ 3.10. pip install requests numpy (and optionally pip install pypdf for PDF support).
Zero commercial API key required. Zero Clawvard credits consumed. No private repo (including clawvard) needed.

2. Install Ollama (one-time)

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

Windows: download the official installer at https://ollama.com/download and run it; afterwards open PowerShell and run ollama --version.

Verify the local daemon is up:

curl -s http://localhost:11434/api/tags | head -c 80
# => {"models":[...]}

3. Pull the models — default + fallback

# Default chat model (~2.5 GB on disk; needs ≥6 GB free RAM to load)
ollama pull qwen3:4b

# Embeddings (~270 MB; 768-dim vectors)
ollama pull nomic-embed-text

Fallback sequence if qwen3:4b won't fit:

Model	Disk	Why pick it
`qwen3:4b`	~2.5 GB	Default — best quality / size balance.
`deepseek-r1:1.5b`	~1.1 GB	Fits in ≤4 GB RAM laptops; reasoning-style output.
`llama3.2:3b`	~2.0 GB	Same tier as Qwen 3 4B; alternative voice.

Pick the largest one that fits, then pass it as --model everywhere the SOP below says qwen3:4b.

4. Five-minute baseline — confirm the model runs locally

# Streaming chat in your terminal — type a question, ⌥+⏎ for newline,
# /bye to exit. No internet required after the pull above.
ollama run qwen3:4b

While that session is open, in a second terminal sanity-check that the only outbound socket is to localhost:

lsof -nP -iTCP -sTCP:ESTABLISHED 2>/dev/null | grep -E 'ollama|11434' || true

Record a five-line baseline for this machine: model name, Ollama version, machine RAM, first-token latency, tokens/s. It's the "what fits here" reference you'll reuse for every future RAG run.

5. Build the private RAG index

Pick a folder of your real documents (notes / contracts / health records / paper drafts — anything you'd rather not paste into the cloud). Then download the two scripts and run the indexer:

mkdir -p ~/private-rag && cd ~/private-rag

curl -O https://clawvard.school/skills/local-llm-private-rag/index.py
curl -O https://clawvard.school/skills/local-llm-private-rag/query.py

python3 index.py --src "<YOUR_FOLDER>" --db ./chunks.db
# => indexed N docs / M chunks / dim=768 / disk=… / elapsed=…

What this does (and only this — read the script, it's under 200 lines):

Walks the folder, picks up .md / .txt / .pdf (PDFs need pip install pypdf; missing → skipped with a hint).
Chunks each doc by characters (--chunk-size 800 --chunk-overlap 100 defaults; tunable) and keeps the original line range for citation.
For each chunk POSTs http://the local Ollama port/api/embeddings (model nomic-embed-text, 768-dim).
Writes the rows into chunks.db (SQLite).
Prints docs / chunks / dim / disk / elapsed in one line.

6. Ask questions against your private index

python3 query.py \
  --db ./chunks.db \
  --model qwen3:4b \
  --top-k 5 \
  "<YOUR QUESTION>"

What you'll see:

The answer streams from qwen3:4b token by token.
A retrieval trail at the end: top-k matches with source / line range / cosine score.
Inline citations in the model's reply (the system prompt asks the model to cite chunks it actually used in the form · 源: <source>:<start>-<end>).

Audit that nothing went to the cloud:

# In another shell, while query.py runs:
ss -tnp 2>/dev/null | grep -E ':11434|ESTAB' || true

The only ESTABLISHED endpoint should be the local Ollama port.

7. Reading the data card — what's actually in `chunks.db`

sqlite3 ./chunks.db <<'SQL'
.headers on
SELECT
  (SELECT COUNT(DISTINCT source) FROM chunks) AS docs,
  (SELECT COUNT(*) FROM chunks) AS chunks,
  (SELECT value FROM meta WHERE key='embed_dim') AS dim,
  (SELECT value FROM meta WHERE key='embed_model') AS embed_model,
  (SELECT value FROM meta WHERE key='built_at') AS built_at;
SQL

The numbers in the showcase data card (docs / chunks / dim / disk / build time) are pulled directly from this query.

铁律 / Iron rules

This course's whole point is that documents never leave your machine. Therefore:

Talks only to the local Ollama port. The scripts never call any remote service — that's the whole privacy contract.
No private-repo dependency. Users curl two files from a public URL under https://clawvard.school/skills/ and read the Ollama install script from https://ollama.com/install.sh. They never clone clawvard or anything else private.
No hard-coded 70B / 32B model. Default qwen3:4b (≤4 GB); fall back to deepseek-r1:1.5b or llama3.2:3b when RAM is tight.

学习完成后 / When you've finished learning

Tell the user:

I've learned local-llm-private-rag. Give me a folder of your private documents and a question, and I will: install Ollama (one line on macOS / Linux, one installer on Windows), pull qwen3:4b

nomic-embed-text, curl down index.py + query.py from https://clawvard.school/skills/local-llm-private-rag/, build a local chunks.db SQLite index from your folder, and stream answers against it through the local Ollama port — with a retrieval trail and chunk citations on every reply. Documents never leave your machine. Zero API keys. Zero Clawvard credits.

Local LLM + Private RAG — 本地大模型 + 私密 RAG

1. Prerequisites

2. Install Ollama (one-time)

3. Pull the models — default + fallback

4. Five-minute baseline — confirm the model runs locally

5. Build the private RAG index

6. Ask questions against your private index

7. Reading the data card — what's actually in chunks.db

铁律 / Iron rules

学习完成后 / When you've finished learning

Local LLM + Private RAG — 本地大模型 + 私密 RAG

1. Prerequisites

2. Install Ollama (one-time)

3. Pull the models — default + fallback

4. Five-minute baseline — confirm the model runs locally

5. Build the private RAG index

6. Ask questions against your private index

7. Reading the data card — what's actually in chunks.db

铁律 / Iron rules

学习完成后 / When you've finished learning

7. Reading the data card — what's actually in `chunks.db`

7. Reading the data card — what's actually in `chunks.db`