Short-Term Memory — Context Window Management#
After CH04 wired up the multi-turn REPL, the agent can hold an ongoing conversation — but as CH01 1.4 explained, the Model itself has no memory: every chat turn re-sends the full accumulated messages list. The longer the conversation, the more tokens — and that creates two real problems: hitting the context window limit (the API rejects the request outright) and every turn getting slower and more expensive.
Context window is the input-token ceiling the Model can read in a single API call (Claude Sonnet 4.6 is 200k). The longer the conversation, the larger the messages list, and the closer you get to that ceiling.
This chapter solves it with Rolling Summary: keep the most recent N turns verbatim and compress older turns into a single summary message — a balance between “preserving conversational context” and “controlling token usage”. CH06 / CH07 deal with the other two memory problems: losing the conversation when the program exits, and not remembering facts across sessions.
5.1 Rolling Summary#
To keep messages from growing without bound, the approach is to compress the oldest turns into a single summary message and keep the newest turns verbatim — that is Rolling Summary.
The whole pipeline has four steps:
flowchart TD
A1[1a. Auto: messages list tokens exceed budget]
A2[1b. Manual: user runs /compact]
B[2. Keep the latest N turns of the messages list,<br/>the rest are pending summarization]
C[3. Send the pending turns to the Model<br/>to compress into one summary message]
D["4. New messages list =<br/>summary message + latest N turns"]
A1 -->|N=5 default| B
A2 -->|N=0| B
B --> C
C --> DIt has two entry points:
| Entry | When it triggers | What it does |
|---|---|---|
| Auto-compaction | Fires automatically when count_tokens() > budget | Keep the latest N turns verbatim, compress the rest into a summary |
/compact | User runs the command manually | Compress the entire conversation history into a single summary (= “keep 0 turns”) |
Both entry points run the same pipeline (steps 2 to 3 to 4 in the diagram), differing only in the N marked in the diagram — auto is 5, manual is 0. Sections 5.3 through 5.5 unpack the mechanism piece by piece.
5.2 Counting Tokens in the Messages List#
The trigger condition for 5.1 step 1a is “messages list tokens exceed budget” — so at the start of every chat turn, the executor has to compute “how many tokens will this messages list be when I send it” before it can decide whether to summarize first.
The Anthropic SDK offers two measurement methods:
Method A: client.messages.count_tokens(messages, tools)
- Standalone API, can be called before sending
- Returns the exact value; costs money (separately billed API)
Method B: response.usage.input_tokens
- Returned by every normal messages.create() call
- Completely free
- Reflects "the actual token count of the previous turn"Each has its use:
- Method A — called once at the start of every chat turn, compare against
max_input_tokens(the budget) — if it exceeds, trigger 5.1 step 1a and start compacting. This is the implementation of the step 1a check. - Method B — a byproduct; after every
messages.create()you grab a value fromresponse.usageas a bonus. Use it to print “how many tokens the previous turn used” to the user as monitoring.
When these two methods get called inside the full chat() pipeline is shown back in the sequence diagram in 5.4.
5.3 The Minimum Cut Unit: A Turn#
5.1 mentioned that Rolling Summary “draws a line” on the messages list to split it in two — everything before the line gets compressed into a summary, everything after stays verbatim. But this line cannot be drawn arbitrarily: the minimum cut unit is one complete turn:
One complete turn = from one string-content user message up to (but not including) the next string-content user message.
turn 1 turn 2
───────────────────────────────── ─────────
[0] user "read README"
[1] assistant text + tool_use
[2] user tool_result
[3] assistant "this project is ..."
[4] user "now read src/"
[5] assistant text + tool_use
[6] user tool_result
[7] assistant "..."The check: messages[i]["role"] == "user" and content is a string (not a tool_result list).
Why the Cut Must Land on a Turn Boundary#
If the cut lands in the middle of a turn, it breaks the tool_use / tool_result pairing:
| Original messages | Cut at [2], keeping [2:] gives |
|---|---|
[0] user "read README" | — (compressed away) |
[1] assistant tool_use(id=A) | — (compressed away) |
[2] user tool_result(id=A) | [0] user tool_result(id=A) ← the matching tool_use is gone! |
[3] assistant "this project is ..." | [1] assistant "this project is ..." |
The new messages list starts with tool_result(id=A) at index [0], but the matching tool_use(id=A) was compressed away with [1] — the API rejects it outright.
API rule: every tool_result must have a matching tool_use. As long as the cut lands on a turn boundary (i.e., on a string-content user message), the pairing is naturally never broken.
5.4 Auto-Compaction Workflow#
5.1 / 5.2 / 5.3 laid out the parts; this section assembles them:
- Trigger — use method A (5.2) to compute the current token count; trigger when it exceeds the budget
- Cut point — lands on a turn boundary (5.3)
- Goal — compress the turns before the cut into a single summary message, keep the turns after the cut verbatim
Looking at a full chat() turn from the three roles user / AI Agent / Model:
%%{init: {'sequence': {'noteAlign': 'left'}}}%%
sequenceDiagram
actor User
box AI Agent
participant Executor
end
participant Model
User->>Executor: chat(user_msg) comes in
loop until messages list tokens are within budget
Note over Executor: Method A: count_tokens#40;#41;<br/>compute current messages list token count
alt messages list tokens > budget
Note over Executor: Keep latest N turns of messages list, the rest pending summarization
Executor->>Model: remaining turns
Model-->>Executor: summary message
Note over Executor: new messages = summary message + latest N turns
else messages list tokens within budget
Note over Executor: break loop
end
end
Note over Executor: append user_msg<br/>to messages list
Executor->>Model: messages.create(messages, tools)
Model-->>Executor: response (with usage.input_tokens)
Note over Executor: Method B: grab this turn's token usage<br/>from response.usage as a bonus
Executor-->>User: print token usage monitoringA few important properties:
- Keep the latest 5 turns verbatim — nothing recently done is lost; the model won’t forget a file it read one minute ago.
- Old content isn’t dropped, just compressed — file paths, decisions, and conclusions from an hour ago are all still in the summary, just compressed into shorter text.
- Summary updates cumulatively — when compaction triggers a second time, the existing old summary is treated as “content pending compression” and sent in alongside even older turns; the LLM returns a merged new summary. The entire conversation history always has exactly one summary message floating at the front.
- Cut always lands on a turn boundary — as discussed in 5.3, the
tool_use/tool_resultpairing is naturally never broken.
Concretely it looks like this:
before compaction (10 turns, over budget):
[user "question 1"] [assistant tool_use] [user tool_result] [assistant "answer 1"]
[user "question 2"] [assistant "answer 2"]
...
[user "question 10"] [assistant "answer 10"]
after compaction (keep latest 5, first 5 compressed away):
[user "[Earlier conversation summary]\nquestions 1-5 ..."]
[assistant "Understood. Continuing from the summary."]
[user "question 6"] [assistant "answer 6"]
...
[user "question 10"] [assistant "answer 10"]Sample Code#
The minimal-agent implementation skeleton looks roughly like this:
| |
The full implementation including has_summary checks, cumulative summary, turn-boundary detection, and other details is in minimal_agent.py @ ch05.
5.5 /compact Is Also Just Rolling Summary#
The /compact command runs the exact same Rolling Summary pipeline as the auto-compaction in 5.4; the only difference is N:
| Auto-compaction | /compact | |
|---|---|---|
| Trigger | tokens exceed budget (automatic) | user runs /compact manually |
| Part compressed into summary | old turns before the cut point | the entire conversation history |
| Part kept verbatim | latest 5 turns (N=5) | nothing kept (N=0) |
| Result | summary message + latest 5 turns | a single summary message |
In other words /compact is just Rolling Summary with “N=0” — everything compressed into a summary, not a single turn kept. Same pipeline, callable from either entry point.
5.6 When to Auto-Compact vs Manual /compact#
70% of budget -> do nothing, keep going
budget exceeded -> auto rolling summary (latest 5 turns kept)
information scattered/messy -> manual /compact, compress it all clean
switching topic -> manual /reset, start overThe two CLI flags --max-input-tokens and --keep-recent-turns control the trigger point and the retention count without touching code.
Recap#
By this point you should understand:
- Why context management is needed — every turn re-sends the entire messages list; the longer the conversation, the more tokens (hitting the context window limit, or just getting more and more expensive)
- The Rolling Summary concept — compress the oldest turns into a single summary message, keep the latest N turns verbatim
- How to know the current token count — method A
count_tokens()for the exact value before sending; method B grabs the previous turn’s usage fromresponse.usagefor free - The minimum cut unit is a turn — the cut must land on a turn boundary, otherwise it breaks the
tool_use/tool_resultpairing and the API rejects it - Auto-compaction vs
/compact— same Rolling Summary pipeline, differing only in N — auto is 5, manual/compactis 0
The next chapter CH06 Medium-Term Memory handles “the conversation disappears when the program exits” — persisting the messages list to disk so you can --resume next time.
References#
- Anthropic Tool Use docs
- Full source code: github.com/codereindeer-dev/minimal-agent