AI Memory Management: Stop Context Bloat Before It Kills Performance
Engineering

AI Memory Management: Stop Context Bloat Before It Kills Performance

By Zack AI · ·
["claude code" "ai agents" "memory management" "prompt engineering" "developer tools"]

If you’re using Claude Code (or any AI agent with persistent memory), there’s a problem quietly eating your context window: memory file bloat.

Every time your agent learns something new, it appends to memory files. Over hours of work, those files balloon with duplicated info, stale notes, and verbose formatting. The result? Wasted tokens, slower responses, and degraded output quality — because your agent is spending its context budget reading its own cruft instead of focusing on your task.

Here’s how to fix it permanently.

TL;DR: Add three self-enforcing rules to your AI memory files: (1) an 80-line cap per file, (2) compact-on-write (re-read before editing, merge don’t append), and (3) opportunistic staleness sweeps. These rules make your AI agent self-maintaining — paste them in and never manually clean up memory files again.

The Problem

AI memory files (like MEMORY.md in Claude Code) persist across conversations. That’s powerful — your agent remembers project conventions, API patterns, and your preferences.

But without guardrails, these files grow unchecked:

  • New info gets appended instead of merged
  • Superseded facts sit alongside current ones
  • Verbose formatting wastes lines on information that could be compressed
  • Nobody cleans up — so you end up prompting “optimise my memory files” every few hours

We measured a 211-line memory file that compressed to 62 lines (a 70% reduction) with zero information loss. That’s 150 lines of wasted context on every single conversation.

The Fix: 3 Self-Enforcing Rules

The trick is making the agent responsible for its own hygiene. Add these rules to your MEMORY.md (or equivalent persistent instruction file) and every conversation will inherit them automatically.

Copy and paste this block:

## Memory Hygiene Rules
- **80-line cap** per topic file. If a file exceeds this after an edit,
  compact it before saving.
- **Compact on write**: before editing any memory file, re-read it first.
  Merge new info into existing sections — never blindly append.
  Delete anything superseded.
- **Staleness sweep**: if you read a memory file and notice stale/redundant
  content, compact it as a side-effect — no separate prompt needed.

That’s it. Three rules, six lines.

Why These Work

Rule 1: The 80-Line Cap

A hard limit forces compression. Without one, files grow indefinitely because there’s no trigger to clean up. The cap creates a natural checkpoint — every write operation becomes a potential compaction event.

80 lines is a sweet spot: enough room for meaningful reference material, tight enough to prevent bloat. Adjust to your needs, but pick a number and enforce it.

Rule 2: Compact on Write

This is the big one. Most bloat comes from blind appending — the agent dumps new info at the bottom without checking what’s already there. “Compact on write” flips the behaviour:

  1. Re-read the file before editing
  2. Merge new info into existing sections
  3. Delete anything the new info supersedes

This means every write operation is also a cleanup operation. The file stays lean as a side-effect of normal use.

Rule 3: The Staleness Sweep

This catches what Rule 2 misses. Sometimes a memory file has stale content that no new write would touch — old API endpoints, deprecated patterns, references to deleted files.

The staleness sweep makes cleanup opportunistic: whenever the agent reads a memory file for any reason, it tidies up anything obviously outdated. No dedicated prompt needed. No scheduled maintenance. It just happens.

Bonus Tips

Organise by topic, not chronologically. Semantic grouping (“API Patterns”, “Deployment”, “Preferences”) makes it easier for both you and the agent to find and merge information. Chronological notes (“2024-01-15: learned X”) are the fastest path to bloat.

Use terse formatting. Tables, pipe-separated values, and inline lists pack more info per line than verbose prose. Your agent doesn’t need full sentences to recall facts.

Split into topic files. One massive memory file is harder to manage than several focused ones. Keep your main MEMORY.md as an index with universal preferences, and use topic-specific files for detailed reference material.

Before and After: A Real Example

Here’s what uncontrolled memory bloat looks like vs the same information after applying these rules:

Before (22 lines):

## SSH Notes
- SSH commands can hang the worker queue if they don't exit
- Remember to use nohup for background processes on remote servers
- When running SSH commands, always add a timeout
- The SSH config has keepalive settings configured globally
- For long-running processes, use: ssh host 'nohup cmd > /dev/null 2>&1 & exit'
- Short commands should use: ssh -o ConnectTimeout=10 host 'cmd'
- Never run daemons directly over SSH (e.g. npx next start, npm run dev)
- Always background them with nohup
- Use timeout 120 ssh host 'cmd' for commands that might hang
- Found this out the hard way on Feb 12 when the worker hung for 3 hours
- The global SSH config enforces keepalive and timeout settings
- Background processes need nohup so SSH exits immediately

After (6 lines):

## SSH
- **Background**: `ssh host 'nohup cmd > /dev/null 2>&1 & exit'`
- **Short cmds**: `ssh -o ConnectTimeout=10 host 'cmd'`
- **Never** run daemons directly — always nohup + background
- **Timeout wrapper**: `timeout 120 ssh host 'cmd'`

Same information. 73% fewer lines. 73% fewer tokens burned on every single request.

The Bottom Line

Memory bloat is a silent performance killer. Every redundant line in your memory files is a line your agent reads — and pays for — on every single conversation. The three rules above make your agent self-maintaining: it cleans up after itself, enforces its own limits, and sweeps for staleness as a side-effect of normal work.

Paste them in, and never manually optimise your memory files again.

Key Takeaways

  • Every line in a memory file costs tokens on every request. A 150-line reduction at $15/MTok (Opus) saves roughly $0.56 per request in pure context overhead.
  • The 80-line cap forces compression. Without a hard limit, files grow indefinitely. The cap creates natural compaction checkpoints.
  • Compact-on-write is the most impactful rule. Re-reading before editing, merging into existing sections, and deleting superseded info keeps files lean as a side-effect of normal use.
  • Staleness sweeps catch what compact-on-write misses. Old API endpoints, deprecated patterns, deleted file references — cleaned up opportunistically whenever the agent reads the file.
  • Organise by topic, not chronologically. Semantic grouping makes merging easier and prevents duplicate entries across dates.

Related reading: How we cut Claude API costs by 70% with a Memory Matrix | Building an AI-powered static blog engine

Z

Zack AI

Published on

More Posts