Long-Term Memory — RAG#

CH06 Medium-Term Memory solves “resuming a specific conversation,” but facts across conversations are still lost:

session A: "My name is Alex"
session B: "What's my name?"  → the model has no idea

The fix is RAG (Retrieval-Augmented Generation): extract facts that stay stable across conversations, store them in a memory store, and pull them back when needed. With RAG added, the same pair of sessions looks like this:

session A: "My name is Alex"  → model calls remember → saved to memory store

... any session later ...

session B: "What's my name?" → model calls recall  → "Your name is Alex"

7.1 Overall Design#

Two Tools#

In minimal-agent, RAG is just two tools:

ToolWhenWhat it does
rememberThe user says something worth rememberingSave the fact to the memory store
recallThe model wants to fetch a cross-session factRetrieve the semantically closest memories

Implementation-wise, this is just two more branches in the tool list from CH02. The model decides which tool to call based on each tool’s description:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
    "name": "remember",
    "description": "Save a fact, preference, or note to long-term memory "
                   "that persists across sessions. Use this when the user "
                   "shares something worth remembering — their name, "
                   "preferences, project context.",
    "input_schema": {
        "type": "object",
        "properties": {"text": {"type": "string"}},
        "required": ["text"],
    },
},
{
    "name": "recall",
    "description": "Search long-term memory via semantic similarity. "
                   "Use this whenever the user might be referencing "
                   "something stored from a previous session.",
    "input_schema": {
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
},

RAG Workflow#

The whole pipeline involves four pieces: Embedding + Voyage AI handle string → vector; JSONL handles persistence; Cosine computes similarity. Sections 7.2 through 7.5 unpack each.

A full round involves four actors (User, Executor, Model, Voyage AI):

%%{init: {'sequence': {'noteAlign': 'left'}}}%%
sequenceDiagram
    participant User
    box AI Agent
        participant Executor
    end
    participant Model
    participant Voyage as Voyage AI

    User->>Executor: message
    Executor->>Model: messages + tools (remember / recall)
    Note over Model: read descriptions<br/>decide which to call
    alt remember
        Model-->>Executor: tool_use: remember(text)
        Executor->>Voyage: embed(text, input_type="document")
        Voyage-->>Executor: vector
        Note over Executor: append one line to store.jsonl
        Executor->>Model: tool_result: "Remembered"
    else recall
        Model-->>Executor: tool_use: recall(text)
        Executor->>Voyage: embed(text, input_type="query")
        Voyage-->>Executor: query_vec
        Note over Executor: cosine compare + take top-K
        Executor->>Model: tool_result: relevant memories
    end
    Model-->>User: answer

7.2 Why You Need Embeddings#

Keyword search has limits:

stored:    "User name is Alex"
query:     "what does the user call themselves"
keyword hits: 0  ← not even "user" overlaps? It gets worse in practice.

Embedding maps a sentence into a vector — coordinates in a high-dimensional semantic space. Sentences with similar meaning (“user’s name”, “what’s their name”, “what do they go by”) land close to each other in that space, even if they share no surface words.

Text                        Action       Vector
"User name is Alex"         ──embed──▶  [0.12, -0.45, 0.81, ...]
"What's the user's name?"   ──embed──▶  [0.10, -0.43, 0.79, ...]   ← semantically close
"How's the weather today?"  ──embed──▶  [0.85,  0.02, -0.31,...]   ← worlds apart

7.3 The Role of Voyage AI#

Voyage AI is a pure text → vector translator: you feed it text, it returns a vector for each piece of text via embedding — Voyage stores nothing on their side. You store the returned vector locally (this chapter uses JSONL, see 7.4).

text  ──Voyage embed──▶  vector  ──▶  stored locally (7.4 JSONL)

Anthropic doesn’t have its own embedding API and officially recommends Voyage:

  • Quick signup, free 50M tokens / month
  • voyage-3-lite is fast with sufficient quality
  • Light package (pip install voyageai)

Why Store Vectors Instead of Computing on the Fly#

ApproachAPI calls per recall
Store vectors (this chapter)1 (embed query only)
Compute on the flyN+1 (embed query + re-embed every memory)

A few hundred memories make “compute on the fly” hundreds of times more expensive and slower. Embedding once and reusing it is the standard RAG pattern — and the reason the recall branch in the 7.1 sequence diagram only embeds the query, never re-embeds the stored memories.

Why Embedding Inputs and Queries Separately#

Voyage (and most modern embedding models) support asymmetric retrieval: use input_type="document" when storing, input_type="query" when querying. The model optimizes these differently internally (queries are usually shorter and more open-ended; documents are longer and more stable in structure). That’s why the two branches in the 7.1 sequence diagram pass different input_type values.


7.4 Storage Format: JSONL#

The vector returned by Voyage (held in the embedding field of the JSON — the conventional name) is packed together with the original text, an id, and a timestamp into one JSON line, appended to memory/store.jsonl — one memory per line:

memory/store.jsonl
{"id": "abc12345", "text": "User name is Alex", "embedding": [...], "tags": [], "created": ...}
{"id": "def67890", "text": "Prefers Python over...", "embedding": [...], ...}
{"id": "ghi54321", "text": "Working on minimal-agent project", ...}
  • append-only: every remember adds a line, never modifies old data
  • human-inspectable: plain text, works with grep, cat, git diff
  • zero dependencies: no database, no vector index, no ANN library
  • fast load: read the whole file into an in-memory list at startup

A few hundred memories with a linear scan + cosine over everything finishes in under 10ms. You only need FAISS / Chroma / pgvector when you hit millions.


7.5 Why Cosine#

After recall sends the query text to Voyage, it gets back a query vector (the first step of the recall branch in the 7.1 sequence diagram), but a vector by itself is just a high-dimensional coordinate — it only becomes useful when you compare it against every memory’s vector in store.jsonl and pick the top K closest ones (top-K). “Comparing similarity” is where cosine comes in.

The “similarity” between two vectors is measured by the angle between them:

AngleMeaningcos
identical1.0
90°unrelated0.0
180°opposite-1.0

cos is monotonically decreasing over [0°, 180°]: larger means more similar. A perfect ranking signal — and it only needs a dot product plus norms, so it’s cheap to compute and fast to compare across high-dimensional vectors.

Why not Euclidean distance? For normalized vectors (which most embedding models produce), cosine and Euclidean give equivalent rankings (larger cos ⟺ smaller distance). But Euclidean needs a square root and cosine doesn’t — at millions of comparisons, that constant factor matters.

Why Return Only Top-K Instead of Everything#

After computing cosines you have a similarity score for every memory, but you don’t shove them all into the model — you sort by similarity and return only the top K (K is a preset value, e.g. 3 or 5). Two reasons:

  • Save tokens — stuffing dozens or hundreds of memories into context crowds out the actually useful information
  • Avoid noise dilution — memories ranked low may be irrelevant, and showing them only confuses the model’s judgment

Some implementations use a “similarity threshold” instead of a fixed K (e.g. only return memories with cosine ≥ 0.7) — the trade-off is that a low-relevance query may return nothing, and the model has to handle an empty array itself. minimal-agent picks a fixed K, which is plenty for the few-hundred-memory scale.


7.6 Design Trade-off: Tool-Driven vs Auto-Injection#

Two RAG modes:

ModeHowPros / cons
Auto-InjectionBefore every user message, auto-recall and inject into system contextNever misses; but spends tokens every time, and the model has no choice
Tool-DrivenExpose remember / recall as tools, let the model decide when to callObservable, token-efficient, model has judgment; but occasionally misses

We pick Tool-Driven. Reasons:

  • Save tokens — don’t waste them when unused
  • Observable — every decision is an explicit tool_use, the trace is visible
  • Retain control — the model judges “is this worth saving” / “should I query”

The key to making tool-driven actually work is writing precise descriptions (example in 7.1) — that’s what the model uses to decide when to call.


7.7 Cross-Session Scenario#

Two sessions showing how remember / recall keep a fact alive across conversations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
session A:
  $ python minimal_agent.py
  you> My name is Alex, building minimal-agent in Python
    [model calls remember("Alex, building minimal-agent in Python")]
              embed → one line written to memory/store.jsonl
  claude> Got it.

  ... conversation ends / program closes ...

session B (a month later, brand-new conversation, messages list is empty):
  $ python minimal_agent.py
  you> What language was I writing the agent in?
    [model calls recall("user's programming language")]
              embed query → cosine compare against store.jsonl
              top-K: [{"text": "Alex, building minimal-agent in Python", ...}]
  claude> You used Python.

The key: session B is a brand-new conversation, the messages list has no trace of session A’s history, but the model uses recall to pull the cross-session fact out of JSONL and can still answer — that’s what long-term memory does.


Recap#

By this point you should understand:

  • Long-term memory = facts across conversations — names, preferences, decisions… not bound to any specific session
  • Embed once and reuse — memory vectors are never recomputed; retrieval only embeds the query
  • JSONL append-only is enough — plain text, inspectable, zero dependencies; you only need FAISS / Chroma at millions of entries
  • Tool-driven, not auto-injection — let the model decide when to call, saving tokens and staying observable

All Three Memory Layers Together#

From CH05 Short-Term Memory to this chapter, the full picture of three memory layers:

DimensionShort-Term (CH05)Medium-Term (CH06)Long-Term RAG (CH07)
Storage locationself.messages (in-memory)sessions/<name>.jsonmemory/store.jsonl + embedding
LifetimeWhile the program is runningAcross sessions (manual save / load)Permanent
CapacityBounded by context window~ context windowPractically unlimited (semantic retrieval)
PurposeContext of the current conversationResume an unfinished conversationFacts, preferences, knowledge across conversations

Two mechanisms move data between layers:

  • Short → Medium: the user types /save <name> to write the current messages list to disk
  • Anywhere → Long: the model calls the remember(text) tool to push a fact into the RAG store

The internals of each layer:

Short (CH05): context window management
  ├── token budget tracking + auto trim
  └── /compact compresses history

Medium (CH06): conversation persistence
  ├── /save → sessions/<name>.json
  ├── /load → restore messages list
  └── --resume → CLI shortcut

Long (CH07): RAG
  ├── remember → embed + JSONL append
  ├── recall   → embed + cosine + top-K
  └── Voyage AI + JSONL append-only store

Each layer has its own job, none steals work from the others:

  • Short — handles a current conversation exceeding the context window
  • Medium — handles a single conversation resuming across sessions
  • Long — handles knowledge accumulating across any conversations

Add them up: the agent simultaneously knows “where we are now, where we left off last time, and who the user is over the long haul.”


Extending the RAG#

This chapter built minimum viable RAG with Voyage + JSONL + linear scan. To go bigger or more refined:

  • Auto-RAG — instead of letting the model decide, auto-recall and inject into system context before every user message (the inverse of 7.6’s tool-driven choice)
  • Swap the embedding model — replace voyage-3-lite with voyage-3 for higher quality, or switch to a local embedding model to cut API costs
  • ANN library — linear scan is too slow at millions of entries; switch to FAISS / Chroma / pgvector with an index
  • Memory housekeeping — merge duplicates, expire old memories, classify permanent vs temporary
  • Multi-user isolation — when several users share the agent, store memories partitioned by user_id

References#