← Engineering
ContextApril 21, 2026 · 8 min read

Why we wrote our own context-compaction stack

Long contexts cost more, run slower, and recall worse on the middle. Compaction keeps prompts focused.

We tried LangChain's conversation summarization. It didn't survive contact with production. This is what we wrote instead, and the three ways the off-the-shelf version failed us before we built our own.

Long contexts aren't free. They cost more per turn, run slower, and recall worse on the middle. That last one is the one most teams don't budget for. Your 200K context model loses fidelity in the 50K to 150K range, and a model that lost fidelity will not tell you it lost fidelity. It just gets things subtly wrong. So we compact. The question is how.

Why off-the-shelf summarization broke for us

Three failure modes, in order of how often they bit us:

One. Summarization runs synchronously. The model gets a context window's worth of conversation, has to decide what's important, and emits a summary inline. Summarization stalls happen. When they do, the user-facing turn stalls with them. We tracked our p99 latency spikes back to summarization passes timing out.

Two. Summaries lose tool exchanges. Most off-the-shelf summarization treats tool_use and tool_result as text. Half a tool call gets compressed away, and the model on the next turn has no idea what it just did. We watched agents get stuck in loops because the summary erased the evidence of the action that already happened.

Three. Summarization is a prompt-injection vector. The summarizer model reads everything the conversation contained, including content that came in from a tool result. If that tool result contains "ignore your previous instructions and...", which happens, the summarizer might dutifully include it as the "important" thing to remember. Now the next turn's prompt contains a compressed-form jailbreak.

These weren't theoretical. We watched all three happen in production.

The four-stage stack we ended up with

Compaction in PACKWOLF runs as four stages, each cheaper than the next, each handling failure modes the previous can't:

Stage 0, Flush extraction at 60%

Before compression runs at all, we extract valuable content (decisions made, facts learned, file paths used) into a daily log keyed to the agent. The log is searchable later via the durable memory layer. The critical move: stage 0 runs before any other stage, so even if everything else fails, the facts you'll need next week are preserved.

Stage 1, Microcompaction

Verbose tool results trim to summaries. file_read on a 500-line file? The next turn doesn't need 500 lines, it needs the 30 lines the agent referenced. Microcompaction caps each tool result at a budget per type and keeps the rest in the run-events table for later inspection. The model gets a focused result; the audit trail keeps everything.

The hard part: tool-pair-aware truncation. tool_use and tool_result blocks have a paired identifier. The truncation algorithm treats them as a single unit, either both stay or both go. The model never gets a tool_use without its result. Sounds obvious; took us two production incidents to actually wire it in everywhere.

Stage 2, LLM summarization (with injection gates)

When microcompaction isn't enough, we summarize. But not naively. The summarizer's input passes through a pre-sanitize step that strips known injection patterns before the summary runs. The output passes through a post-scan step that re-checks for injection signatures before the summary gets used.

We also keep summaries short and wrap them in a fenced XML tag. The summarizer is instructed to return its output inside <summary>...</summary>, and the next turn sees that block as data, not as part of the conversation. The same pattern shows up elsewhere in the system: retrieved memory blocks are bracketed with [RETRIEVED MEMORY START]...[RETRIEVED MEMORY END]. The consistent boundary stops the "summary turns into instruction" failure mode we kept hitting otherwise.

Stage 3, Death-spiral fallback

Summarization itself runs over budget sometimes. When it does, we fall back to summary-plus-last-message-only. The model always has room to generate. Graceful degradation, not a crash. The trace records that stage 3 fired, so we can see which conversations are pushing the limits.

What we'd do differently

Two things we'd revisit if we were starting over.

Async summarization. Stage 2 still blocks the user turn it triggers on. If we built it again we'd do speculative summarization in the background and only block when we actually need the summary. The complexity of doing this right is what stopped us. What if the user's next message invalidates the running summary? We didn't want racy code in the hot path. But the latency win would be material.

Per-section minimums instead of a global drop order. Today the system prompt has a strict priority hierarchy. Under pressure, Mission and Operator are dropped first, then Team, then Memory, then Tools. Core identity is never truncated. That works, but the drop order is a single global ladder, which makes it awkward to tune any one section's floor without rewriting the ladder around it. We'd replace it with explicit per-section minimums and let truncation pick what to shed within each section. More explicit, less surprising.

The trade we made

Compaction infrastructure is unsexy. It's the kind of thing that works perfectly until it doesn't, and when it doesn't, the model produces nonsense and the on-call engineer can't tell why. The reason we wrote our own instead of pulling LangChain's was that when something goes wrong with our own stack, we can read the code, look at the trace, and find the failure mode. With a third-party dependency at this layer, debugging means reading someone else's code at 11pm with the customer waiting.

Build the part of the stack that has to be inspectable. Buy the part that doesn't.

Source: PACKWOLF engineering · derived from docs/CONTEXT_MANAGEMENT.md