Harness Engineering: The Most Important Part of AI Agents
TL;DR LLMs don’t become agents because they’re more intelligent, but because we place them inside a system that makes them usable. In recent years, we’ve seen an impressive acceleration in the world of language models. Every month (every day!) something more powerful, more efficient, more “intelligent” comes out. And inevitably, the conversation always focuses there: which model to use, how many parameters it has, how well it performs on benchmarks. But when you try to build something real, something interesting happens: the model stops being the main problem. When you move from a demo to a system that actually has to work (with real users, messy data, unpredictable edge cases), you realize that the LLM alone isn’t enough. Not because it isn’t powerful enough, but because it isn’t designed to be reliable. This is where what’s called harness engineering comes into play. There’s a concept that comes up often lately: agent = model + harness. It sounds like a simplification, but it’s actually a very accurate description of what happens in practice. The model generates text. The harness decides what that text means, what to do with it, when to trust it, and when not to. It’s a subtle distinction, but it completely changes the way you design a system. Because the moment you start building an agent, you are implicitly also building a way to manage context, call external tools, verify that the output makes sense, and recover when something goes wrong. And none of that lives inside the model. It lives around the model. Anyone who has worked even a little with these systems has already seen the problem. Same prompt, same input, two different outputs. That’s not a bug. It’s the nature of the model. LLMs are not deterministic systems designed to be 100% reliable. They are excellent at generalizing, less so at guaranteeing consistency. And this is where the developer’s role changes. You’re no longer writing code that does things. And that system is the harness. For a long time, we treated these problems as an extension of prompt engineering. “Let’s write a better prompt.” That works, up to a point. Then you start adding automatic retries, structured parsing, output validation, memory between steps. And without realizing it, you’re no longer working on a prompt — you’re designing a system. This is probably the most important transition: moving from thinking in terms of input/output to thinking in terms of flows, states, and controls. A useful way to think about the harness is as a translation layer. On one side, the model: operating in natural language, probabilistic, flexible. On the other side, the real world: APIs that break, incomplete data, rigid formats, irreversible actions. The harness sits in between and acts as a mediator. It takes something “soft” (the model’s text) and turns it into something “hard” (concrete actions). Let’s say we have two applications using the exact same LLM and getting completely different results. At first, it seems strange. But looking closer, you realize the difference isn’t in the model. It’s in everything around it: How context is managed When tools are called What happens when something fails In other words: the harness.
