I Cut Coding Agent Context Usage by 22–45% by Killing Context Bloat
A lot of AI coding workflows degrade the exact same way. At first, everything feels incredible. understands the project moves insanely fast eliminates boilerplate compounds your momentum Then a few weeks later: AGENTS.md turns into a novel. Prompts get bloated. The model starts missing obvious things. Responses become inconsistent. Token usage quietly becomes absurd. I kept running into this while building Empirical. Eventually I realized the problem wasn’t: “The model needs more context.” The problem was: That distinction changed everything. Most teams solve AI memory like this: “Just add it to the prompt.” And over time the context fills up with: architecture decisions coding standards deployment notes UI preferences old implementation details temporary fixes abandoned experiments half-finished thoughts Eventually every request drags all of it around forever. Even when most of it has absolutely nothing to do with the current task. That creates a brutal signal-to-noise problem. The model starts treating temporary junk and critical architecture decisions with equal importance. You can actually feel the degradation happen. the agent gets fuzzier architecture drift increases outputs become inconsistent you spend more time correcting than building I think the industry is optimizing the wrong thing right now. Everyone keeps pushing toward: million-token windows infinite memory larger context sizes stuffing more into prompts But humans don’t work like that either. Good engineering teams don’t bring every document into every meeting. Most information is situational. Most memory should stay dormant until it becomes relevant. That was the shift for me. “How do I fit more into context?” “How do I load only what matters right now?” I started treating AI memory more like layered working memory instead of permanent prompt stuffing. Keep permanent instructions extremely small. architecture principles coding philosophy project identity non-negotiables That layer should stay lean on purpose. Pull implementation knowledge dynamically based on: semantic similarity current task related code paths previous work in the same area Only relevant context enters the active prompt. Use temporary working memory for: bugs in-progress features short-lived implementation decisions Then let it expire naturally instead of polluting long-term memory forever. The biggest surprise wasn’t even the token savings. It was how much sharper the agents became once the noise disappeared. responses became more focused architecture stayed more consistent prompt babysitting dropped significantly outputs drifted less between sessions The token reduction was just the measurable side effect. Workflow Context Reduction Smaller focused tasks ~22% Larger iterative workflows Up to ~45% That compounds fast once agents start looping. I think a lot of AI tooling is accidentally recreating bad human organizational habits. We already know what happens when people dump everything into: giant docs giant meetings giant Slack threads giant Notion pages Clarity collapses. Coding agents seem to behave better when memory works more like human working memory: small active focus relevant recall long-term memory separated from immediate attention That mattered far more than raw context size. I wrote the complete breakdown here: retrieval architecture layered memory strategy implementation lessons where the 22–45% savings actually came from → Reducing Coding Agent Context Usage by 22–45% with Retrieval-Based Memory Systems
