Back to blog
The Most Expensive Thing Your AI Agent Does Is Forget
Engineering

The Most Expensive Thing Your AI Agent Does Is Forget

Siddharth KhuranaSiddharth Khurana··8 min read

I asked an AI agent to build four features on an existing codebase. It spent the first ten minutes reading 24 files — backend routes, type definitions, config files, package manifests — just to understand what it was looking at. By the time it was ready to write code, it had ingested 120KB of context.

Then I ran the same task on the same codebase, but this time the agent had access to structured session logs from the previous build sessions. It read 13 files. Ingested 33KB. Got to the same place — knowing what to build and how — with a fraction of the overhead.

But the real difference wasn't the token count. It was what the second agent knew that the first one couldn't.

Code Tells You What. Not Why.

When an AI agent reads your codebase, it sees everything that exists. Every file, every function, every route. What it can't see is the reasoning behind it.

It doesn't know you chose local disk storage on purpose because cloud storage is planned for later. It doesn't know the chat system uses both REST and WebSocket deliberately — REST for paginated history, sockets for live messages. It doesn't know the matching algorithm uses a placeholder metric because you haven't built the real one yet.

So the agent makes assumptions. Reasonable ones. And sometimes those assumptions are wrong. It might add S3 integration when your team explicitly deferred it. It might refactor chat to be socket-only and break the pagination. It might build a sophisticated scoring system for a metric that was always meant to be temporary.

Each of these costs far more to fix than to prevent. The code shows the current state. The reasoning that produced it is gone — lost in a chat session that closed hours ago.

What If the Agent Could Remember?

That's what I tested. I built a demo project — Cuff, a dating app concept — across several Claude Code sessions, with structured logging enabled throughout. Every time the agent made a decision, produced an output, hit a blocker, or reached a milestone, it logged what happened and why to an external memory vault via Woxpas.

Seven session logs captured the full build history. Not just what was created, but the reasoning: why npm workspaces over Turborepo, why Prisma over raw SQL, why a hybrid REST + Socket.IO architecture for chat, how the matching algorithm works and what it's substituting for.

Then I set up the real test. Same codebase, same task — build four new mobile features. Two sessions:

  • Path A had access to the session logs via MCP
  • Path B had only the filesystem

Both arrived at the same conclusion about what to build. The path to get there was different.

The Numbers

With MemoryWithout Memory
Tool calls to "ready to code"2226
Data ingested~33KB~120KB
Source files read1324
Knows why things were built that wayYesNo

The agent with session logs skipped 11 source files — API routes, shared type definitions, the API client. The logs had already documented what those files contain: endpoints, request/response shapes, interfaces. It only read files it was about to modify.

3.6x less data ingested. But more importantly — the agent with memory carried context that no amount of file reading could have provided.

The Three Things Only Logs Know

The agent without memory understood the codebase perfectly. It could trace every function call, map every dependency. But it couldn't answer:

Why is storage local? The code has multer writing to disk. A competent agent sees this and thinks "this should be S3." The logs say: local storage is deliberate. Cloud storage is on the roadmap but deferred. Don't touch it.

Why does chat use two protocols? The code has both a REST endpoint and a Socket.IO connection for messaging. A competent agent sees redundancy and might consolidate. The logs say: REST handles paginated history that pre-dates the socket layer. Both are needed. Don't merge them.

Why is the matching algorithm so simple? The code has a basic scoring function. A competent agent might think it's an MVP placeholder that needs sophistication. The logs say: profile completeness is a deliberate stand-in because behavioral data doesn't exist yet. The simplicity is the design, not a shortcut.

In each case, the agent without memory would make a reasonable choice that happens to be wrong. The cost of that wrong choice — building something, reviewing it, reverting it, rebuilding — is orders of magnitude more than the cost of loading a few kilobytes of session logs.

What Broke Along the Way

This didn't work on the first try.

The biggest challenge was getting agents to actually log consistently. Instructions alone weren't enough — the agent would get deep into a coding problem and forget it was supposed to be documenting its work. We ended up combining custom project instructions with enforcement hooks that remind the agent to log before it writes code. Once the enforcement was in place, logging became reliable.

The second problem was retrieval efficiency. The initial implementation returned too much data at session start — hundreds of kilobytes of metadata when the agent just needed a lightweight list of what logs exist. We added filtering by project and document type, a summary-only mode for browsing, and batch retrieval for the cases where full content is needed. The retrieval layer had to be as lean as the logs themselves, or the overhead cancelled out the savings.

The third problem was log quality. The first round of logs said things like "Created API server with all routes." Useless for the next session — it still had to read every route file to understand the interfaces. The logs needed to capture enough detail that another agent could continue without re-reading source files, but not so much that they became source code dumps. We landed on 3-5 sentences per entry, focused on interfaces — what it accepts, what it returns, what it exports.

How It Works

The protocol has three phases.

Session start. The agent checks if a project scope exists in the memory vault, loads previous session logs, and reviews decisions, blockers, and checkpoints before starting any work. If this is a new project, it creates a scope and skips the review — there's nothing to load yet.

During the session. After each logical unit of work — a feature, a fix, a research pass, an architecture choice — the agent logs what happened and why before moving on. The unit is the intent, not the file count. Twenty files scaffolded for one feature is one log entry. The logs are saved to the vault in real time, not batched at the end.

Session end. The agent writes a final summary: what was accomplished, key decisions, and what to build next. This is what the next session reads first.

The logs live in Woxpas, which means they're accessible from any MCP-compatible tool. Claude Code writes the logs today; Cursor picks them up tomorrow. One memory, every tool.

What's Still Missing

The logs don't yet include an explicit continuation plan. Our last session log said "all features complete" but gave no direction for what comes next. The next agent had to figure out the roadmap from scratch. A dedicated next-steps section in the session summary would close this gap.

There's also work to do on teaching agents to trust the logs. Even with good session logs loaded, agents sometimes re-read source files out of an abundance of caution. The logs reduce file reads significantly — 13 versus 24 in our test — but the ideal is that an agent only reads files it's about to modify. We're not there yet.

Who Is This For?

Anyone building with AI agents across multiple sessions. If you're a developer using Claude Code or Cursor on a project that takes more than one sitting, your agent is rediscovering context every time it starts. If you're a founder running agents across code, marketing, and operations, decisions are being made and forgotten daily. If you manage a team where AI makes choices, you need an audit trail.

The session logs are queryable. "What decisions were made about authentication?" returns exactly that. "What blocked the last session?" returns the blockers. "What was the reasoning behind the current architecture?" returns the tradeoffs. Three months from now, when someone asks "why was this built this way?" — the answer is in the logs, not lost in a closed chat window.

Every AI session starts at zero. It doesn't have to.

Get Started with Woxpas →

Ready to build your memory?

Get started and be part of shaping how Woxpas grows.

Get Started