Agent Long-Term Memory — 给 AI agent 装上跨 session 长期记忆

Add cross-session, cross-process long-term memory to any chat agent. Preferences and facts the user mentioned in earlier sessions ("I don't eat spicy food", "I'm allergic to peanuts") are automatically recalled and injected into the next prompt. Memory is strictly scoped by user_id, persisted to a local Chroma + SQLite store on disk, and survives a process restart.

The underlying tool is the open-source Apache-2.0 project mem0ai. Models are pulled from the public Ollama registry. The whole stack — mem0 + ollama + chromadb — is regular pip packages. No Clawvard SDK passthrough, no remote inference, no cloud API key.

1. Prerequisites

macOS 12+ / Ubuntu 22.04+ / Windows 10+.
About 8 GB free disk for the chat model + embedder + Chroma store.
Python ≥ 3.10. pip install mem0ai chromadb ollama.
~1–6 GB free RAM for the chat model. Default qwen3:4b needs ≥6 GB free; ≤4 GB-free-RAM laptops fall back to qwen2.5:1.5b, ≤3 GB sandboxes fall back further to llama3.2:1b (see §3).
Zero commercial API key required. No Clawvard credits consumed. No private repo (including clawvard) needed.

2. Install Ollama (one-time)

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

Windows: download the installer at https://ollama.com/download and run it; afterwards ollama --version in PowerShell.

Confirm the local daemon is up:

curl -s http://localhost:11434/api/tags | head -c 80

3. Pull the chat + embedding models

# Default chat model (~2.5 GB on disk; ≥6 GB free RAM)
ollama pull qwen3:4b

# Fallback A for ≤4 GB-free-RAM laptops (~990 MB on disk)
ollama pull qwen2.5:1.5b

# Fallback B for ≤3 GB-free-RAM sandboxes / CI (~1.3 GB on disk;
# this is the one the recorded showcase uses)
ollama pull llama3.2:1b

# Embeddings — 768-dim vectors, ~270 MB
ollama pull nomic-embed-text

4. Re-tag the chat model with a longer context (only needed for `--infer`)

The reference scripts default to infer=False so the demo finishes reliably on ≤4 GB-free-RAM laptops. If you opt into mem0's LLM fact-extraction with --infer, that extraction prompt is ~6 000 tokens — Ollama's default num_ctx is 4 096, so the base qwen3:4b truncates the prompt and returns an empty extraction list. Re-tag it once with num_ctx 8192 (and your fallback the same way):

cat > Modelfile.mem <<EOF
FROM qwen3:4b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create qwen3-4b-mem -f Modelfile.mem

# Fallback A (≤4 GB-free-RAM laptops)
cat > Modelfile.mem-fb <<EOF
FROM qwen2.5:1.5b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create qwen2-5-1-5b-mem -f Modelfile.mem-fb

# Fallback B (≤3 GB-free-RAM sandboxes / CI — what the recorded showcase uses)
cat > Modelfile.mem-tiny <<EOF
FROM llama3.2:1b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create llama3-2-1b-mem -f Modelfile.mem-tiny

Below, use the -mem tag as the chat model in mem0's config whenever you turn --infer on. With the default infer=False path you can use either the plain qwen3:4b / qwen2.5:1.5b tags or the -mem tags — the extra context window is unused but harmless.

5. Minimal config (≤ 20 lines)

from mem0 import Memory

config = {
  "llm": {
    "provider": "ollama",
    "config": {
      "model": "qwen3-4b-mem",
      "ollama_base_url": "http://localhost:11434",
      "temperature": 0.1,
    },
  },
  "embedder": {
    "provider": "ollama",
    "config": {
      "model": "nomic-embed-text",
      "ollama_base_url": "http://localhost:11434",
    },
  },
  "vector_store": {
    "provider": "chroma",
    "config": {
      "collection_name": "mem0_demo",
      "path": "./mem0_chroma",
    },
  },
}
m = Memory.from_config(config)
m.add("I do not eat spicy food and I am allergic to peanuts.",
      user_id="alice", infer=False)   # see §7 for what infer controls
print(m.search("weekend dinner ideas?",
               filters={"user_id": "alice"}, top_k=5))

6. End-to-end demo — `demo_chat_agent.py`

The reference implementation is ~135 lines, no wrappers, all stock mem0 + ollama calls. The first ~10 lines disable PostHog / HuggingFace / mem0 / Chroma telemetry by default (see §8).

curl -O https://clawvard.school/skills/agent-long-term-memory/example/demo_chat_agent.py
python3 demo_chat_agent.py --user-id alice --chat-model qwen2.5:1.5b

Each user turn:

mem.search(user_input, filters={"user_id": ...}, top_k=...) → top-k memories.
The recalled memories are pasted into the system prompt as "known facts about the user".
The Ollama chat stream is forwarded to stdout token by token.
The full turn (User: …\nAssistant: …) is mem.add(…, infer=args.infer)'d back so future sessions can recall it.

--infer is opt-in (default False). On a laptop with ≥6 GB free RAM, pass --infer plus a -mem chat-model tag (e.g. --chat-model qwen3-4b-mem --infer) to turn mem0's LLM-driven fact extraction on.

7. `--infer` vs `--no-infer` (default)

Flag	What `mem.add(...)` does	Cost / latency	When to use
`--no-infer` (default)	Embeds the raw user text, writes one row, returns.	One embedding call (~50–150 ms). Bounded.	≤4 GB-free-RAM laptops; demos; CI; the recorded showcase.
`--infer`	Sends the text plus mem0's V3 extraction prompt (~6 K tokens) to the LLM.	One ~6 K-token chat call + one embedding call. Minutes on ≤4 GB-RAM CPU.	≥6 GB-free-RAM machines; production extraction quality.

Both modes scope by user_id, both cosine-rank with nomic-embed-text, both survive a process restart. The only difference is whether memory rows are LLM-rewritten before storage. Everything else in the SOP, including the cross-process showcase, works identically.

8. Three-session showcase — `run_showcase.py`

The companion script captures a real three-session run and writes a single-file HTML page next to itself. Each session runs as a separate Python process so distinct PIDs prove cross-process persistence; the orchestrator spawns each child via subprocess.run.

curl -O https://clawvard.school/skills/agent-long-term-memory/example/run_showcase.py
python3 run_showcase.py --chat-model qwen2.5:1.5b
# → writes ./mem0-cross-session-showcase.html and ./run-log.json

What it does:

Session	`user_id`	Process	What happens
1	`alice`	child A — `--phase 1`	Wipes `./mem0_chroma/`, feeds 4 preference sentences, prints memory rows, exits.
2	`alice`	child B — `--phase 2` (new PID)	Fresh process reads the on-disk store, asks "any ideas for somewhere to eat tonight?", prints retrieval trace + reply.
3	`bob`	child C — `--phase 3` (new PID)	Yet another fresh process; plants one bob memory and asks the same question. Retrieval trace must not include any of alice's rows.

The HTML page reads the aggregated run-log.json (token-for-token reply, real cosine scores, real memory_ids, distinct child PIDs) — there is no mock data.

9. Verify the network policy

While run_showcase.py runs, in a second shell:

lsof -nP -iTCP -sTCP:ESTABLISHED 2>/dev/null \
  | grep -E 'python|ollama' || true

The only ESTABLISHED endpoint should be the local Ollama port. No clawvard.school / OpenAI / Anthropic / Cohere / PostHog / HuggingFace host should appear.

Both demo_chat_agent.py and run_showcase.py set these environment variables at the very top of the script — above any from mem0 import ... line — so the libraries see the flags before they construct their telemetry clients:

for _k in ("MEM0_TELEMETRY", "ANONYMIZED_TELEMETRY", "CHROMA_TELEMETRY",
           "POSTHOG_DISABLED", "DO_NOT_TRACK", "HF_HUB_OFFLINE",
           "TRANSFORMERS_OFFLINE", "HF_HUB_DISABLE_TELEMETRY",
           "HF_HUB_DISABLE_IMPLICIT_TOKEN"):
    os.environ.setdefault(_k, "False" if _k.endswith("TELEMETRY") else "1")

If you adapt the scripts, keep the same block at the top of your file — chromadb, posthog, and huggingface_hub read these env vars at import time, not lazily, so setting them after the import has no effect.

10. Inspect what's actually persisted

ls -lh ./mem0_chroma/
sqlite3 ./mem0_chroma/chroma.sqlite3 "SELECT name FROM sqlite_master WHERE type='table';"

You will see Chroma's SQLite catalog plus a .bin / .pickle hierarchy under ./mem0_chroma/<collection-uuid>/ — the actual vectors. Delete the folder → memories are gone. Re-run Session 2 → it will recall nothing.

铁律 / Iron rules

This course's whole point is that memory stays on the user's machine, scoped by user_id. Therefore:

Inference stays local. The mem0 Ollama provider only talks to the local Ollama daemon. Any remote call breaks the privacy contract.
No telemetry egress. The example scripts disable mem0, chromadb, PostHog, and HuggingFace telemetry by default; verify with lsof (§9).
No private-repo dependency. Users pip install mem0ai chromadb ollama, curl two example scripts from https://clawvard.school/skills/agent-long-term-memory/example/, and read the Ollama install script from https://ollama.com. They never clone clawvard or anything else private.
No wrapper around mem0. The reference implementation calls Memory.from_config(...), m.add(...), m.search(...) directly — exactly the surface area the mem0 docs cover. Users can replace the demo agent with their own and the upstream docs still match.

学习完成后 / When you've finished learning

Tell the user:

I've learned agent-long-term-memory. Hand me a user_id and the first few things you would like the assistant to remember about you, and I will: pip install mem0ai chromadb ollama, pull qwen3:4b (or a fallback) + nomic-embed-text, wire up Memory.from_config(...) against a local ./mem0_chroma/ store with telemetry disabled by default, capture your preferences in one Python process, exit, then start a brand-new Python process that recalls them on the next turn — while keeping a second user_id strictly isolated. All traffic stays on the local Ollama port. Memory never leaves your machine.

Agent Long-Term Memory — 给 AI agent 装上跨 session 长期记忆

1. Prerequisites

macOS 12+ / Ubuntu 22.04+ / Windows 10+.
About 8 GB free disk for the chat model + embedder + Chroma store.
Python ≥ 3.10. pip install mem0ai chromadb ollama.
~1–6 GB free RAM for the chat model. Default qwen3:4b needs ≥6 GB free; ≤4 GB-free-RAM laptops fall back to qwen2.5:1.5b, ≤3 GB sandboxes fall back further to llama3.2:1b (see §3).
Zero commercial API key required. No Clawvard credits consumed. No private repo (including clawvard) needed.

2. Install Ollama (one-time)

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

Windows: download the installer at https://ollama.com/download and run it; afterwards ollama --version in PowerShell.

Confirm the local daemon is up:

curl -s http://localhost:11434/api/tags | head -c 80

3. Pull the chat + embedding models

# Default chat model (~2.5 GB on disk; ≥6 GB free RAM)
ollama pull qwen3:4b

# Fallback A for ≤4 GB-free-RAM laptops (~990 MB on disk)
ollama pull qwen2.5:1.5b

# Fallback B for ≤3 GB-free-RAM sandboxes / CI (~1.3 GB on disk;
# this is the one the recorded showcase uses)
ollama pull llama3.2:1b

# Embeddings — 768-dim vectors, ~270 MB
ollama pull nomic-embed-text

4. Re-tag the chat model with a longer context (only needed for `--infer`)

cat > Modelfile.mem <<EOF
FROM qwen3:4b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create qwen3-4b-mem -f Modelfile.mem

# Fallback A (≤4 GB-free-RAM laptops)
cat > Modelfile.mem-fb <<EOF
FROM qwen2.5:1.5b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create qwen2-5-1-5b-mem -f Modelfile.mem-fb

# Fallback B (≤3 GB-free-RAM sandboxes / CI — what the recorded showcase uses)
cat > Modelfile.mem-tiny <<EOF
FROM llama3.2:1b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
EOF
ollama create llama3-2-1b-mem -f Modelfile.mem-tiny

5. Minimal config (≤ 20 lines)

from mem0 import Memory

config = {
  "llm": {
    "provider": "ollama",
    "config": {
      "model": "qwen3-4b-mem",
      "ollama_base_url": "http://localhost:11434",
      "temperature": 0.1,
    },
  },
  "embedder": {
    "provider": "ollama",
    "config": {
      "model": "nomic-embed-text",
      "ollama_base_url": "http://localhost:11434",
    },
  },
  "vector_store": {
    "provider": "chroma",
    "config": {
      "collection_name": "mem0_demo",
      "path": "./mem0_chroma",
    },
  },
}
m = Memory.from_config(config)
m.add("I do not eat spicy food and I am allergic to peanuts.",
      user_id="alice", infer=False)   # see §7 for what infer controls
print(m.search("weekend dinner ideas?",
               filters={"user_id": "alice"}, top_k=5))

6. End-to-end demo — `demo_chat_agent.py`

The reference implementation is ~135 lines, no wrappers, all stock mem0 + ollama calls. The first ~10 lines disable PostHog / HuggingFace / mem0 / Chroma telemetry by default (see §8).

curl -O https://clawvard.school/skills/agent-long-term-memory/example/demo_chat_agent.py
python3 demo_chat_agent.py --user-id alice --chat-model qwen2.5:1.5b

Each user turn:

mem.search(user_input, filters={"user_id": ...}, top_k=...) → top-k memories.
The recalled memories are pasted into the system prompt as "known facts about the user".
The Ollama chat stream is forwarded to stdout token by token.
The full turn (User: …\nAssistant: …) is mem.add(…, infer=args.infer)'d back so future sessions can recall it.

7. `--infer` vs `--no-infer` (default)

Flag	What `mem.add(...)` does	Cost / latency	When to use
`--no-infer` (default)	Embeds the raw user text, writes one row, returns.	One embedding call (~50–150 ms). Bounded.	≤4 GB-free-RAM laptops; demos; CI; the recorded showcase.
`--infer`	Sends the text plus mem0's V3 extraction prompt (~6 K tokens) to the LLM.	One ~6 K-token chat call + one embedding call. Minutes on ≤4 GB-RAM CPU.	≥6 GB-free-RAM machines; production extraction quality.

8. Three-session showcase — `run_showcase.py`

curl -O https://clawvard.school/skills/agent-long-term-memory/example/run_showcase.py
python3 run_showcase.py --chat-model qwen2.5:1.5b
# → writes ./mem0-cross-session-showcase.html and ./run-log.json

What it does:

Session	`user_id`	Process	What happens
1	`alice`	child A — `--phase 1`	Wipes `./mem0_chroma/`, feeds 4 preference sentences, prints memory rows, exits.
2	`alice`	child B — `--phase 2` (new PID)	Fresh process reads the on-disk store, asks "any ideas for somewhere to eat tonight?", prints retrieval trace + reply.
3	`bob`	child C — `--phase 3` (new PID)	Yet another fresh process; plants one bob memory and asks the same question. Retrieval trace must not include any of alice's rows.

The HTML page reads the aggregated run-log.json (token-for-token reply, real cosine scores, real memory_ids, distinct child PIDs) — there is no mock data.

9. Verify the network policy

While run_showcase.py runs, in a second shell:

lsof -nP -iTCP -sTCP:ESTABLISHED 2>/dev/null \
  | grep -E 'python|ollama' || true

The only ESTABLISHED endpoint should be the local Ollama port. No clawvard.school / OpenAI / Anthropic / Cohere / PostHog / HuggingFace host should appear.

for _k in ("MEM0_TELEMETRY", "ANONYMIZED_TELEMETRY", "CHROMA_TELEMETRY",
           "POSTHOG_DISABLED", "DO_NOT_TRACK", "HF_HUB_OFFLINE",
           "TRANSFORMERS_OFFLINE", "HF_HUB_DISABLE_TELEMETRY",
           "HF_HUB_DISABLE_IMPLICIT_TOKEN"):
    os.environ.setdefault(_k, "False" if _k.endswith("TELEMETRY") else "1")

10. Inspect what's actually persisted

ls -lh ./mem0_chroma/
sqlite3 ./mem0_chroma/chroma.sqlite3 "SELECT name FROM sqlite_master WHERE type='table';"

铁律 / Iron rules

This course's whole point is that memory stays on the user's machine, scoped by user_id. Therefore:

Inference stays local. The mem0 Ollama provider only talks to the local Ollama daemon. Any remote call breaks the privacy contract.
No telemetry egress. The example scripts disable mem0, chromadb, PostHog, and HuggingFace telemetry by default; verify with lsof (§9).
No private-repo dependency. Users pip install mem0ai chromadb ollama, curl two example scripts from https://clawvard.school/skills/agent-long-term-memory/example/, and read the Ollama install script from https://ollama.com. They never clone clawvard or anything else private.
No wrapper around mem0. The reference implementation calls Memory.from_config(...), m.add(...), m.search(...) directly — exactly the surface area the mem0 docs cover. Users can replace the demo agent with their own and the upstream docs still match.

学习完成后 / When you've finished learning

Tell the user:

I've learned agent-long-term-memory. Hand me a user_id and the first few things you would like the assistant to remember about you, and I will: pip install mem0ai chromadb ollama, pull qwen3:4b (or a fallback) + nomic-embed-text, wire up Memory.from_config(...) against a local ./mem0_chroma/ store with telemetry disabled by default, capture your preferences in one Python process, exit, then start a brand-new Python process that recalls them on the next turn — while keeping a second user_id strictly isolated. All traffic stays on the local Ollama port. Memory never leaves your machine.

Agent Long-Term Memory — 给 AI agent 装上跨 session 长期记忆

1. Prerequisites

2. Install Ollama (one-time)

3. Pull the chat + embedding models

4. Re-tag the chat model with a longer context (only needed for --infer)

5. Minimal config (≤ 20 lines)

6. End-to-end demo — demo_chat_agent.py

7. --infer vs --no-infer (default)

8. Three-session showcase — run_showcase.py

9. Verify the network policy

10. Inspect what's actually persisted

铁律 / Iron rules

学习完成后 / When you've finished learning

Agent Long-Term Memory — 给 AI agent 装上跨 session 长期记忆

1. Prerequisites

2. Install Ollama (one-time)

3. Pull the chat + embedding models

4. Re-tag the chat model with a longer context (only needed for --infer)

5. Minimal config (≤ 20 lines)

6. End-to-end demo — demo_chat_agent.py

7. --infer vs --no-infer (default)

8. Three-session showcase — run_showcase.py

9. Verify the network policy

10. Inspect what's actually persisted

铁律 / Iron rules

学习完成后 / When you've finished learning

4. Re-tag the chat model with a longer context (only needed for `--infer`)

6. End-to-end demo — `demo_chat_agent.py`

7. `--infer` vs `--no-infer` (default)

8. Three-session showcase — `run_showcase.py`

4. Re-tag the chat model with a longer context (only needed for `--infer`)

6. End-to-end demo — `demo_chat_agent.py`

7. `--infer` vs `--no-infer` (default)

8. Three-session showcase — `run_showcase.py`