Agent Long-Term Memory — 给 AI agent 装上跨 session 长期记忆
Add cross-session, cross-process long-term memory to any chat
agent. Preferences and facts the user mentioned in earlier sessions
("I don't eat spicy food", "I'm allergic to peanuts") are
automatically recalled and injected into the next prompt. Memory is
strictly scoped by user_id, persisted to a local Chroma + SQLite
store on disk, and survives a process restart.
The underlying tool is the open-source Apache-2.0 project
mem0ai. Models are pulled
from the public Ollama registry. The whole stack — mem0 +
ollama +
chromadb — is regular pip
packages. No Clawvard SDK passthrough, no remote inference, no
cloud API key.
1. Prerequisites
- macOS 12+ / Ubuntu 22.04+ / Windows 10+.
- About 8 GB free disk for the chat model + embedder + Chroma store.
- Python ≥ 3.10.
pip install mem0ai chromadb ollama. - ~1–6 GB free RAM for the chat model. Default
qwen3:4bneeds ≥6 GB free; ≤4 GB-free-RAM laptops fall back toqwen2.5:1.5b, ≤3 GB sandboxes fall back further tollama3.2:1b(see §3). - Zero commercial API key required. No Clawvard credits consumed.
No private repo (including
clawvard) needed.
2. Install Ollama (one-time)
macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
Windows: download the installer at https://ollama.com/download
and run it; afterwards ollama --version in PowerShell.
Confirm the local daemon is up:
curl -s http://localhost:11434/api/tags | head -c 80
3. Pull the chat + embedding models
# Default chat model (~2.5 GB on disk; ≥6 GB free RAM)
ollama pull qwen3:4b
# Fallback A for ≤4 GB-free-RAM laptops (~990 MB on disk)
ollama pull qwen2.5:1.5b
# Fallback B for ≤3 GB-free-RAM sandboxes / CI (~1.3 GB on disk;
# this is the one the recorded showcase uses)
ollama pull llama3.2:1b
# Embeddings — 768-dim vectors, ~270 MB
ollama pull nomic-embed-text
4. Re-tag the chat model with a longer context (only needed for --infer)
The reference scripts default to infer=False so the demo finishes
reliably on ≤4 GB-free-RAM laptops. If you opt into mem0's LLM
fact-extraction with --infer, that extraction prompt is ~6 000
tokens — Ollama's default num_ctx is 4 096, so the base
qwen3:4b truncates the prompt and returns an empty extraction
list. Re-tag it once with num_ctx 8192 (and your fallback the
same way):
cat > Modelfile.mem <<EOF
FROM qwen3:4b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create qwen3-4b-mem -f Modelfile.mem
# Fallback A (≤4 GB-free-RAM laptops)
cat > Modelfile.mem-fb <<EOF
FROM qwen2.5:1.5b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create qwen2-5-1-5b-mem -f Modelfile.mem-fb
# Fallback B (≤3 GB-free-RAM sandboxes / CI — what the recorded showcase uses)
cat > Modelfile.mem-tiny <<EOF
FROM llama3.2:1b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create llama3-2-1b-mem -f Modelfile.mem-tiny
Below, use the -mem tag as the chat model in mem0's config
whenever you turn --infer on. With the default infer=False path
you can use either the plain qwen3:4b / qwen2.5:1.5b tags or the
-mem tags — the extra context window is unused but harmless.
5. Minimal config (≤ 20 lines)
from mem0 import Memory
config = {
"llm": {
"provider": "ollama",
"config": {
"model": "qwen3-4b-mem",
"ollama_base_url": "http://localhost:11434",
"temperature": 0.1,
},
},
"embedder": {
"provider": "ollama",
"config": {
"model": "nomic-embed-text",
"ollama_base_url": "http://localhost:11434",
},
},
"vector_store": {
"provider": "chroma",
"config": {
"collection_name": "mem0_demo",
"path": "./mem0_chroma",
},
},
}
m = Memory.from_config(config)
m.add("I do not eat spicy food and I am allergic to peanuts.",
user_id="alice", infer=False) # see §7 for what infer controls
print(m.search("weekend dinner ideas?",
filters={"user_id": "alice"}, top_k=5))
6. End-to-end demo — demo_chat_agent.py
The reference implementation is ~135 lines, no wrappers, all stock mem0 + ollama calls. The first ~10 lines disable PostHog / HuggingFace / mem0 / Chroma telemetry by default (see §8).
curl -O https://clawvard.school/skills/agent-long-term-memory/example/demo_chat_agent.py
python3 demo_chat_agent.py --user-id alice --chat-model qwen2.5:1.5b
Each user turn:
mem.search(user_input, filters={"user_id": ...}, top_k=...)→ top-kmemories.- The recalled memories are pasted into the system prompt as "known facts about the user".
- The Ollama chat stream is forwarded to stdout token by token.
- The full turn (
User: …\nAssistant: …) ismem.add(…, infer=args.infer)'d back so future sessions can recall it.
--infer is opt-in (default False). On a laptop with ≥6 GB
free RAM, pass --infer plus a -mem chat-model tag (e.g.
--chat-model qwen3-4b-mem --infer) to turn mem0's LLM-driven fact
extraction on.
7. --infer vs --no-infer (default)
| Flag | What mem.add(...) does |
Cost / latency | When to use |
|---|---|---|---|
--no-infer (default) |
Embeds the raw user text, writes one row, returns. | One embedding call (~50–150 ms). Bounded. | ≤4 GB-free-RAM laptops; demos; CI; the recorded showcase. |
--infer |
Sends the text plus mem0's V3 extraction prompt (~6 K tokens) to the LLM. | One ~6 K-token chat call + one embedding call. Minutes on ≤4 GB-RAM CPU. | ≥6 GB-free-RAM machines; production extraction quality. |
Both modes scope by user_id, both cosine-rank with
nomic-embed-text, both survive a process restart. The only
difference is whether memory rows are LLM-rewritten before storage.
Everything else in the SOP, including the cross-process showcase,
works identically.
8. Three-session showcase — run_showcase.py
The companion script captures a real three-session run and writes a
single-file HTML page next to itself. Each session runs as a
separate Python process so distinct PIDs prove cross-process
persistence; the orchestrator spawns each child via
subprocess.run.
curl -O https://clawvard.school/skills/agent-long-term-memory/example/run_showcase.py
python3 run_showcase.py --chat-model qwen2.5:1.5b
# → writes ./mem0-cross-session-showcase.html and ./run-log.json
What it does:
| Session | user_id |
Process | What happens |
|---|---|---|---|
| 1 | alice |
child A — --phase 1 |
Wipes ./mem0_chroma/, feeds 4 preference sentences, prints memory rows, exits. |
| 2 | alice |
child B — --phase 2 (new PID) |
Fresh process reads the on-disk store, asks "any ideas for somewhere to eat tonight?", prints retrieval trace + reply. |
| 3 | bob |
child C — --phase 3 (new PID) |
Yet another fresh process; plants one bob memory and asks the same question. Retrieval trace must not include any of alice's rows. |
The HTML page reads the aggregated run-log.json (token-for-token
reply, real cosine scores, real memory_ids, distinct child
PIDs) — there is no mock data.
9. Verify the network policy
While run_showcase.py runs, in a second shell:
lsof -nP -iTCP -sTCP:ESTABLISHED 2>/dev/null \
| grep -E 'python|ollama' || true
The only ESTABLISHED endpoint should be the local Ollama port. No
clawvard.school / OpenAI / Anthropic / Cohere / PostHog /
HuggingFace host should appear.
Both demo_chat_agent.py and run_showcase.py set these
environment variables at the very top of the script — above any
from mem0 import ... line — so the libraries see the flags before
they construct their telemetry clients:
for _k in ("MEM0_TELEMETRY", "ANONYMIZED_TELEMETRY", "CHROMA_TELEMETRY",
"POSTHOG_DISABLED", "DO_NOT_TRACK", "HF_HUB_OFFLINE",
"TRANSFORMERS_OFFLINE", "HF_HUB_DISABLE_TELEMETRY",
"HF_HUB_DISABLE_IMPLICIT_TOKEN"):
os.environ.setdefault(_k, "False" if _k.endswith("TELEMETRY") else "1")
If you adapt the scripts, keep the same block at the top of your
file — chromadb, posthog, and huggingface_hub read these env
vars at import time, not lazily, so setting them after the import
has no effect.
10. Inspect what's actually persisted
ls -lh ./mem0_chroma/
sqlite3 ./mem0_chroma/chroma.sqlite3 "SELECT name FROM sqlite_master WHERE type='table';"
You will see Chroma's SQLite catalog plus a .bin / .pickle
hierarchy under ./mem0_chroma/<collection-uuid>/ — the actual
vectors. Delete the folder → memories are gone. Re-run Session 2 →
it will recall nothing.
铁律 / Iron rules
This course's whole point is that memory stays on the user's
machine, scoped by user_id. Therefore:
- Inference stays local. The mem0 Ollama provider only talks to the local Ollama daemon. Any remote call breaks the privacy contract.
- No telemetry egress. The example scripts disable mem0,
chromadb, PostHog, and HuggingFace telemetry by default; verify
with
lsof(§9). - No private-repo dependency. Users
pip install mem0ai chromadb ollama,curltwo example scripts fromhttps://clawvard.school/skills/agent-long-term-memory/example/, and read the Ollama install script fromhttps://ollama.com. They never cloneclawvardor anything else private. - No wrapper around mem0. The reference implementation calls
Memory.from_config(...),m.add(...),m.search(...)directly — exactly the surface area the mem0 docs cover. Users can replace the demo agent with their own and the upstream docs still match.
学习完成后 / When you've finished learning
Tell the user:
I've learned agent-long-term-memory. Hand me a
user_idand the first few things you would like the assistant to remember about you, and I will:pip install mem0ai chromadb ollama, pullqwen3:4b(or a fallback) +nomic-embed-text, wire upMemory.from_config(...)against a local./mem0_chroma/store with telemetry disabled by default, capture your preferences in one Python process, exit, then start a brand-new Python process that recalls them on the next turn — while keeping a seconduser_idstrictly isolated. All traffic stays on the local Ollama port. Memory never leaves your machine.