Short-Term Memory — Context Window Management#

After CH04 wired up the multi-turn REPL, the agent can hold an ongoing conversation — but as CH01 1.4 explained, the Model itself has no memory: every chat turn re-sends the full accumulated messages list. The longer the conversation, the more tokens — and that creates two real problems: hitting the context window limit (the API rejects the request outright) and every turn getting slower and more expensive.

Context window is the input-token ceiling the Model can read in a single API call (Claude Sonnet 4.6 is 200k). The longer the conversation, the larger the messages list, and the closer you get to that ceiling.

This chapter solves it with Rolling Summary: keep the most recent N turns verbatim and compress older turns into a single summary message — a balance between “preserving conversational context” and “controlling token usage”. CH06 / CH07 deal with the other two memory problems: losing the conversation when the program exits, and not remembering facts across sessions.


5.1 Rolling Summary#

To keep messages from growing without bound, the approach is to compress the oldest turns into a single summary message and keep the newest turns verbatim — that is Rolling Summary.

The whole pipeline has four steps:

flowchart TD
    A1[1a. Auto: messages list tokens exceed budget]
    A2[1b. Manual: user runs /compact]
    B[2. Keep the latest N turns of the messages list,<br/>the rest are pending summarization]
    C[3. Send the pending turns to the Model<br/>to compress into one summary message]
    D["4. New messages list =<br/>summary message + latest N turns"]
    A1 -->|N=5 default| B
    A2 -->|N=0| B
    B --> C
    C --> D

It has two entry points:

EntryWhen it triggersWhat it does
Auto-compactionFires automatically when count_tokens() > budgetKeep the latest N turns verbatim, compress the rest into a summary
/compactUser runs the command manuallyCompress the entire conversation history into a single summary (= “keep 0 turns”)

Both entry points run the same pipeline (steps 2 to 3 to 4 in the diagram), differing only in the N marked in the diagram — auto is 5, manual is 0. Sections 5.3 through 5.5 unpack the mechanism piece by piece.


5.2 Counting Tokens in the Messages List#

The trigger condition for 5.1 step 1a is “messages list tokens exceed budget” — so at the start of every chat turn, the executor has to compute “how many tokens will this messages list be when I send it” before it can decide whether to summarize first.

The Anthropic SDK offers two measurement methods:

Method A: client.messages.count_tokens(messages, tools)
   - Standalone API, can be called before sending
   - Returns the exact value; costs money (separately billed API)

Method B: response.usage.input_tokens
   - Returned by every normal messages.create() call
   - Completely free
   - Reflects "the actual token count of the previous turn"

Each has its use:

  • Method A — called once at the start of every chat turn, compare against max_input_tokens (the budget) — if it exceeds, trigger 5.1 step 1a and start compacting. This is the implementation of the step 1a check.
  • Method B — a byproduct; after every messages.create() you grab a value from response.usage as a bonus. Use it to print “how many tokens the previous turn used” to the user as monitoring.

When these two methods get called inside the full chat() pipeline is shown back in the sequence diagram in 5.4.


5.3 The Minimum Cut Unit: A Turn#

5.1 mentioned that Rolling Summary “draws a line” on the messages list to split it in two — everything before the line gets compressed into a summary, everything after stays verbatim. But this line cannot be drawn arbitrarily: the minimum cut unit is one complete turn:

One complete turn = from one string-content user message up to (but not including) the next string-content user message.

turn 1                               turn 2
─────────────────────────────────    ─────────
[0] user      "read README"
[1] assistant text + tool_use
[2] user      tool_result
[3] assistant "this project is ..."
                                     [4] user      "now read src/"
                                     [5] assistant text + tool_use
                                     [6] user      tool_result
                                     [7] assistant "..."

The check: messages[i]["role"] == "user" and content is a string (not a tool_result list).

Why the Cut Must Land on a Turn Boundary#

If the cut lands in the middle of a turn, it breaks the tool_use / tool_result pairing:

Original messagesCut at [2], keeping [2:] gives
[0] user "read README"— (compressed away)
[1] assistant tool_use(id=A)— (compressed away)
[2] user tool_result(id=A)[0] user tool_result(id=A)the matching tool_use is gone!
[3] assistant "this project is ..."[1] assistant "this project is ..."

The new messages list starts with tool_result(id=A) at index [0], but the matching tool_use(id=A) was compressed away with [1] — the API rejects it outright.

API rule: every tool_result must have a matching tool_use. As long as the cut lands on a turn boundary (i.e., on a string-content user message), the pairing is naturally never broken.


5.4 Auto-Compaction Workflow#

5.1 / 5.2 / 5.3 laid out the parts; this section assembles them:

  • Trigger — use method A (5.2) to compute the current token count; trigger when it exceeds the budget
  • Cut point — lands on a turn boundary (5.3)
  • Goal — compress the turns before the cut into a single summary message, keep the turns after the cut verbatim

Looking at a full chat() turn from the three roles user / AI Agent / Model:

%%{init: {'sequence': {'noteAlign': 'left'}}}%%
sequenceDiagram
    actor User
    box AI Agent
        participant Executor
    end
    participant Model
    User->>Executor: chat(user_msg) comes in
    loop until messages list tokens are within budget
        Note over Executor: Method A: count_tokens#40;#41;<br/>compute current messages list token count
        alt messages list tokens > budget
            Note over Executor: Keep latest N turns of messages list, the rest pending summarization
            Executor->>Model: remaining turns
            Model-->>Executor: summary message
            Note over Executor: new messages = summary message + latest N turns
        else messages list tokens within budget
            Note over Executor: break loop
        end
    end
    Note over Executor: append user_msg<br/>to messages list
    Executor->>Model: messages.create(messages, tools)
    Model-->>Executor: response (with usage.input_tokens)
    Note over Executor: Method B: grab this turn's token usage<br/>from response.usage as a bonus
    Executor-->>User: print token usage monitoring

A few important properties:

  • Keep the latest 5 turns verbatim — nothing recently done is lost; the model won’t forget a file it read one minute ago.
  • Old content isn’t dropped, just compressed — file paths, decisions, and conclusions from an hour ago are all still in the summary, just compressed into shorter text.
  • Summary updates cumulatively — when compaction triggers a second time, the existing old summary is treated as “content pending compression” and sent in alongside even older turns; the LLM returns a merged new summary. The entire conversation history always has exactly one summary message floating at the front.
  • Cut always lands on a turn boundary — as discussed in 5.3, the tool_use / tool_result pairing is naturally never broken.

Concretely it looks like this:

before compaction (10 turns, over budget):
  [user "question 1"] [assistant  tool_use] [user tool_result] [assistant "answer 1"]
  [user "question 2"] [assistant "answer 2"]
  ...
  [user "question 10"] [assistant "answer 10"]

after compaction (keep latest 5, first 5 compressed away):
  [user "[Earlier conversation summary]\nquestions 1-5 ..."]
  [assistant "Understood. Continuing from the summary."]
  [user "question 6"]  [assistant "answer 6"]
  ...
  [user "question 10"] [assistant "answer 10"]

Sample Code#

The minimal-agent implementation skeleton looks roughly like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class Agent:
    def chat(self, user_msg: str) -> str:
        self._trim_if_needed()              # 1. method A counts tokens, compact if over

        self.messages.append({"role": "user", "content": user_msg})
        response = self.client.messages.create(
            model="claude-...",
            messages=self.messages,
            tools=self.tools,
        )
        self.messages.append({"role": "assistant", "content": response.content})

        self.last_input_tokens = response.usage.input_tokens   # 2. method B
        return response_text

    def _trim_if_needed(self):
        while self.count_tokens() > self.max_input_tokens:
            self._summarize_oldest_turns()  # cut point -> summarize -> reassemble

The full implementation including has_summary checks, cumulative summary, turn-boundary detection, and other details is in minimal_agent.py @ ch05.


5.5 /compact Is Also Just Rolling Summary#

The /compact command runs the exact same Rolling Summary pipeline as the auto-compaction in 5.4; the only difference is N:

Auto-compaction/compact
Triggertokens exceed budget (automatic)user runs /compact manually
Part compressed into summaryold turns before the cut pointthe entire conversation history
Part kept verbatimlatest 5 turns (N=5)nothing kept (N=0)
Resultsummary message + latest 5 turnsa single summary message

In other words /compact is just Rolling Summary with “N=0” — everything compressed into a summary, not a single turn kept. Same pipeline, callable from either entry point.


5.6 When to Auto-Compact vs Manual /compact#

70% of budget        -> do nothing, keep going
budget exceeded      -> auto rolling summary (latest 5 turns kept)
information scattered/messy -> manual /compact, compress it all clean
switching topic       -> manual /reset, start over

The two CLI flags --max-input-tokens and --keep-recent-turns control the trigger point and the retention count without touching code.


Recap#

By this point you should understand:

  • Why context management is needed — every turn re-sends the entire messages list; the longer the conversation, the more tokens (hitting the context window limit, or just getting more and more expensive)
  • The Rolling Summary concept — compress the oldest turns into a single summary message, keep the latest N turns verbatim
  • How to know the current token count — method A count_tokens() for the exact value before sending; method B grabs the previous turn’s usage from response.usage for free
  • The minimum cut unit is a turn — the cut must land on a turn boundary, otherwise it breaks the tool_use / tool_result pairing and the API rejects it
  • Auto-compaction vs /compact — same Rolling Summary pipeline, differing only in N — auto is 5, manual /compact is 0

The next chapter CH06 Medium-Term Memory handles “the conversation disappears when the program exits” — persisting the messages list to disk so you can --resume next time.


References#