Why Your AI Agent Loses the Plot: Reasoning Decay and Attention Loss in Long-Running Tasks

DEV Community

Frank Brsrk

Apr 25, 2026, 10:02 AM

A reference on why long-running agents fail at depth, the math behind why errors compound, and the architectural patterns that respond to it. title: "Why Your AI Agent Loses the Plot: Reasoning Decay and Attention Loss in Long-Running Tasks" If you've built anything with an LLM agent (Claude Code, a custom LangGraph workflow, an AutoGPT-style loop), you've probably seen this movie: The first ten minutes are magic. The agent reasons clearly, picks the right tools, makes steady progress. Then, somewhere around the thirty-minute mark, things get weird. The agent starts repeating itself. It forgets a constraint it acknowledged twenty steps ago. It tries an approach that already failed. It "fixes" something by reverting an earlier fix. The reasoning that looked crisp now looks confused. This piece is about the two overlapping failure modes responsible for that drift, the structural reasons they happen, and the architectural patterns that respond to them. It is intended as a reference rather than a hot take, so it leans heavily on cited work and avoids prescriptions that aren't grounded in either practice or measurement. The terms get used loosely. Worth pulling them apart. Attention loss sits at the substrate level. Transformer attention spreads softmax weight across every token in context, so as conversation, scratchpad, tool outputs, and prior decisions accumulate, the share of attention any single token gets becomes thinner. The constraint set at step 3 doesn't disappear from memory. The model is just less likely to surface it cleanly when it matters again at step 40. This sits in the same family as the lost-in-the-middle effect documented by Liu et al. (2023): facts buried mid-context are recalled less reliably than the same facts placed near the start or end of the window. The effect is task-dependent and softens in newer long-context-trained models, but the qualitative pattern is robust enough that production systems should not rely on attention to surface what matters in a long undifferentiated blob. Reasoning decay sits at the behavioral level. The chain of thought stops being crisp: it loops, it drifts, it forgets the goal, it doubles back on solved subproblems. Attention loss is one cause, but not the only one. Even with perfect retrieval and a fresh context, multi-step reasoning has a mathematical floor that worsens with horizon length. Fixing the context alone does not save you from the math; fixing the math alone does not save you from a polluted context. If each step in an agent's plan is independently 95% reliable (which is very good), a 20-step plan succeeds at: 0.95 ^ 20 ≈ 0.36 A 100-step plan succeeds at 0.95 ^ 100 ≈ 0.006. Six in a thousand. The independence assumption is a simplification: agent errors are correlated, because a model that misunderstands the task at step 2 tends to misunderstand it at step 12. That worsens the picture rather than improving it. And unlike pure reasoning, agents cannot always undo their actions. A non-refundable booking, a deleted file, a sent email do not roll back when tokens regenerate. This is why long-horizon agent benchmarks show steep failure curves past a few hundred dependent steps. METR's work on long-horizon task completion, for instance, has found that doubling task duration roughly quadruples failure rate, with a noticeable cliff in the 30 to 40 minute range for current-generation agents. The cliff moves outward as base models improve, but the curve shape is robust enough to design against. The structural cause has two distinct layers, and a serious response engages both. The first layer is the architecture around each reasoning step: where information flows, how state is preserved, how subgoals are decomposed, how steps connect. Most documented patterns for long-running agents operate here. They shape the agent system around the model. The second layer is the structure inside each reasoning step: what shape the model's reasoning takes when it fires, what failure modes it actively blocks, what scaffold its conclusion is built against. By default, all of that is implicit. The model improvises a reasoning path each time. Improvisation is fine in shallow tasks; it is where the wheels come off in long ones. The sections below describe five established patterns at the first layer and an emerging pattern at the second. They compose. Each addresses a different surface of the same underlying problem. Under the hood, several mechanisms feed into the spiral: Context pollution. Failed tool calls, dead-end reasoning, retry chatter, and stale state all stay in the window unless explicitly evicted. They keep competing for attention forever. Goal drift. Without periodic re-grounding, the agent optimizes against a slowly mutating version of the original task. By step 50 it is solving a problem that is subtly not the one asked. Confidence miscalibration. The model often cannot tell its own earlier reasoning was wrong, so it builds on top of bad assumptions instead of backtracking. Hallucinated tool parameters become "facts" by step 15. Loop traps. Agents get stuck in cycles (try X, fail, try Y, fail, try X again) because the failure signal is not structured strongly enough to break the pattern. State/world mismatch. The agent's internal model of the file system, the database, or the API state diverges from reality and never gets corrected. Better models help with all of these (confidence calibration in particular tracks capability), but they do not make the problems disappear. The shape of the failure is structural: information accumulates inside a finite-attention process and errors propagate through dependent steps. Architecture is the higher-leverage axis, and it compounds with whatever the model gives. These are patterns that have emerged in practice. They were largely discovered by people whose agents kept breaking and have since been documented in engineering reports and research literature. The default agent loop appends everything: every prompt, every tool call, every result, every reflection. At each step, build the context deliberately from a smaller, structured store. python def build_step_context(task, state): return { "system": SYSTEM_PROMPT, "task": task.goal, # always present, never edited "constraints": task.constraints, # always present "current_subgoal": state.current_subgoal, "recent_steps": state.history[-3:], # last few only "relevant_artifacts": retrieve( # pulled in by relevance query=state.current_subgoal, store=state.artifact_store, k=5, ), "scratchpad": state.scratchpad, # explicitly managed } The agent does not see "everything that has happened." It sees a compiled view relevant to right now. The full history lives in an external store, and only what is needed gets surfaced. --- 2. Planner-worker decomposition This architecture has become the default for serious long-running agents and is documented at length in Anthropic's Building Effective Agents (2024), which describes orchestrator-worker variants used in Claude Code and similar systems. Cursor, AWS Strands, and Google's ADK use closely related patterns. ┌─────────────────────────────┐ │ Planner (frontier model) │ │ - Holds the high-level │ │ goal and strategy │ │ - Decomposes into tasks │ │ - Reviews results │ └──────────────┬──────────────┘ │ ┌──────▼──────┐ │ Task queue │ └──────┬──────┘ │ ┌───────────┼───────────┐ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌────────┐ │Worker 1│ │Worker 2│ │Worker 3│ │ (short │ │ (short │ │ (short │ │ loop) │ │ loop) │ │ loop) │ └────────┘ └────────┘ └────────┘ The planner stays uncluttered because it never touches per-task tool-call noise. The workers stay uncluttered because each is a short-lived loop with a narrow goal. No single context window has to carry the whole task. This pushes the cliff outward by shortening the dependency chains any single reasoning loop has to maintain. --- 3. Externalize state, then re-read it deliberately Don't trust attention to surface what matters. Write key decisions, constraints, and progress to durable artifacts (files, a structured scratchpad, a small database) and have the agent re-read them at decision points. # Bad: hope the model remembers agent.run(task) # Better: explicit re-grounding while not done: plan = agent.plan( task=task, constraints=read_file("constraints.md"), progress=read_file("progress.md"), ) result = execute(plan) update_file("progress.md", result) done = check_done(task, result) The agent's "memory" becomes a thing one can inspect, version, and edit. Debugging gets dramatically easier as a side effect. --- 4. Critic loops and self-reflection If per-step reliability has a hard ceiling, the way out is making errors catchable rather than rarer. Shinn et al. (2023) formalized this in Reflexion, where an agent receives verbal feedback on its own outputs and refines them iteratively. The simpler form is a separate critic agent reviewing each step before it commits. def step_with_critic(state): proposal = actor.propose(state) critique = critic.review(proposal, state) if critique.approves: return execute(proposal) return step_with_critic(state.with_feedback(critique)) This is the insight behind frameworks that have pushed reliable agent execution to long horizons: stop chasing lower individual error rates, design for error correction. --- 5. Bounded retries and explicit loop detection Detect cycles and break out programmatically. A simple hash of recent (action, result) pairs catches a lot of loops the model cannot see itself in: recent_signatures = [] def take_step(state): proposal = agent.propose(state) sig = hash((proposal.action, proposal.target)) if recent_signatures.count(sig) >= 2: return escalate_to_planner(state, reason="loop_detected") recent_signatures.append(sig) return execute(proposal) The agent often cannot notice it is in a loop. The architecture has to. The second layer: structuring the reasoning step itself The five patterns above all operate around the reasoning step. They shape what information the model receives, what other models check its work, and what happens between thoughts. Inside the thought itself, the model is still improvising. There is a complementary pattern that addresses the inside of the step: provide the reasoning structure itself, retrieved at runtime, matched to the task type, injected before the model reasons. The model still does the reasoning. It does it against a scaffold that names the path, blocks the shortcut, and identifies the failure mode to actively avoid. Conceptually, the artifact looks like this: NEGATIVE GATE the failure mode to actively block, named explicitly PROCEDURE ordered steps with backtrack-if conditions TOPOLOGY a small DAG of S (steps), G (gates), N (failure traps), M (reflection nodes that let the model abandon the current path and re-enter at a named step) TARGET PATTERN what correct reasoning looks like for this task type SUPPRESSION signals biasing the model away from the shortcut and toward the structural check In code, the integration point is shallow: the topology is fetched at the start of the reasoning step, prepended to context, and the model proceeds. # Conventional: implicit reasoning result = agent.reason(task) # With injected reasoning structure topology = topology_library.match(task) # task-matched scaffold result = agent.reason(task, scaffold=topology) Different task types want different topologies. A coding task wants an engineering procedure with explicit backtrack conditions. A long-horizon analytical task wants a metacognitive loop that re-grounds against the goal at each gate. An advice or judgment task wants something closer to a directness enforcer, not a deliberative scaffold; applying a deliberative reasoning structure to advice tasks introduces hedging where directness was the right answer. Selecting the right topology for the task is the engineering problem most naive implementations underestimate. This pattern shares lineage with programmatic-prompting frameworks like DSPy (Khattab et al.), which compiles prompt programs at design time. The runtime-injection variant differs in that the structures are retrieved per task rather than compiled once, which lets the topology track task type at inference rather than at deployment. What this addresses is the part of the failure surface the architectural patterns leave untouched. Context engineering ensures the right information reaches the model; it does not constrain how the model reasons over it. Critic loops catch errors after the fact; they do not prevent the shortcut at its source. Loop detection catches behavioral cycles; it does not address the reasoning shape that produced the cycle. Runtime injection acts before the model commits, which is structurally earlier than any of the architectural patterns can intervene. It is not a substitute for the first-layer patterns. It composes with them. The two layers address two different surfaces of the same problem: the path between reasoning steps and the structure inside each step. When not to bother These mitigations are not free. Planner-worker layers the planner's tokens on top of every worker's, with overhead ranging from modest to roughly doubling total inference cost depending on how the split lands. Critic loops add another model pass per step. Curated context retrieval adds latency and infra overhead. Logging state to disk between steps slows everything down. Runtime topology injection adds one extra call per agent invocation. A useful rule of thumb: if the task completes in under five minutes of agent runtime and under twenty dependent tool calls, none of these patterns are necessary. Reach for them when the task cannot fit that envelope. There is a measurement question hiding here as well. "My agent gets worse over time" and "my agent cannot do this task at all" look identical from the outside but require different fixes. Before architecting around decay, confirm decay is what is actually being seen. Log per-step success against horizon length and look for a curve. Flat-and-high failure rate is a capability problem, and these patterns will not help with it. The takeaway The pattern shows up across model families and sizes because the cause is structural: Attention is finite, so unbounded context accumulation drowns the signal that needs to be heard. Per-step errors compound badly with horizon length, so individual step accuracy alone cannot carry a long task. The agent cannot reliably detect its own decay, so the correction has to come from the system around it. The reasoning step itself has a default shape that breaks at depth, so making the reasoning structure explicit and task-matched is a leverage point separate from the architectural patterns. The teams getting the most out of long-running agents are not the ones leaning on the biggest context windows. They are the ones treating the agent as a system with multiple distinct layers and engineering each one rather than hoping for it: context compiled rather than accumulated, horizons decomposed rather than bulldozed, state externalized rather than implicit, and reasoning structure provisioned rather than improvised. The deeper shift behind all of this is that the next era of agents will not be defined by how big the context window gets or how smart the next base model is. It will be defined by the cognitive infrastructure that wraps the model: the reasoning structure injected at the right moment, the context compiled at the right granularity, the failure modes blocked before the model commits, the route between thoughts engineered rather than left to chance. The model is one component. The reliable agent is the model plus the architecture that keeps it crisp under load. Build for decay. The future maintainer, debugging an agent that spent four hours politely reverting its own work, will be glad of it. If you've hit your own variant of the 35-minute cliff, the comments are open. Failure modes are useful; the more of them get cataloged, the less guesswork goes into the next system that has to survive past hour two. References Liu, Nelson F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. Anthropic (2024). Building Effective Agents. Engineering blog. Shinn, Noah et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS. Khattab, Omar et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. Stanford NLP. METR. Measuring AI Ability to Complete Long Tasks. Long-horizon agent benchmarking.