Content Deduplication

The agents had a habit. It was an understandable habit, even a logical one, but it was also enormously wasteful: they repeated themselves.

Every agent session started with the same system prompt. Every tool call included the same function signatures in its schema. Every error recovery sequence began with the same diagnostic preamble. The agents didn't know they were being redundant—each session was a fresh context, and the prompts and schemas were injected by the framework, not chosen by the agent. But the result was the same: identical blocks of text appearing in hundreds, then thousands, of turns across the database.

Jay ran the analysis on a Friday afternoon. He pulled every blob hash from the database and counted references. The results were staggering.

"The system prompt blob has 4,217 references," he said. "Four thousand two hundred seventeen turns point to the same blob. It's stored once. Eighty-two kilobytes. Without deduplication, we'd be storing 346 megabytes of identical system prompts."

Navan pulled up the next most-referenced blob. It was the function schema for the coding agent's tool set—a structured Msgpack payload listing every tool the agent could call, with parameter types and descriptions. Forty-one kilobytes. Referenced by 3,891 turns. Without dedup: 160 megabytes of identical tool schemas.

"And it gets worse," Jay said, scrolling through the ranked list. "The error diagnostic template. The test runner output format. The Cedar policy schema. Every piece of boilerplate that the framework injects into every conversation exists in thousands of turns."

He ran the total numbers. The database contained 312,000 turns with a gross payload size of 4.8 terabytes if every blob were stored independently. The actual storage, after content-addressed deduplication: 1.1 terabytes. A 77 percent reduction.

"Seventy-seven percent," Navan repeated. "Three quarters of our storage was duplicated content."

"Is duplicated content," Jay corrected. "The dedup is ongoing. Every new turn that reuses existing content pays zero marginal storage cost. The savings compound over time. As we add more agents, more conversations, more sessions, the dedup ratio will only go up."

Justin looked at the numbers and made a quiet observation. "The agents are repetitive because the framework makes them repetitive. Every session starts fresh, so every session reinjects the same context. CXDB doesn't change that behavior. It just makes the cost of that behavior negligible."

"We're not optimizing the agents," Navan said. "We're optimizing the infrastructure around them."

"We're making their inefficiency affordable," Justin said. "Which is different from making them efficient. But it solves the same problem."

Jay saved the analysis report. He titled it The Redundancy Dividend. The more repetitive the agents were, the more CXDB saved. The system's greatest strength was parasitic on the agents' greatest weakness.

It was, Jay thought, almost beautiful.

Software Factory Archive

Kudos: 81