The AI audit rep-curve: why 1 run gives you 67 percent reliability

DEV Community

Code Pocket

May 11, 2026, 10:41 PM

For most of 2025, the standard AI-search audit I saw from peer agencies looked the same: run a list of prompts once each, screenshot the outputs, code the citations, write the report. Sometimes the prompt list was thoughtful. Sometimes the engines were comprehensive. The methodology, though, almost always assumed that one run per prompt was enough. It isn't. We learned this slowly, then quickly, then expensively. Our first GEO audit, back in mid-2025, ran 30 prompts once each on four engines and shipped the report. The client made a budget decision based on it. A month later, doing a follow-up before any work had actually been implemented, we re-ran the same prompts and got materially different citation results on a notable share of them. The variance was bigger than the trend we'd been claiming. The report we'd shipped was, in retrospect, an artifact of a single-day snapshot of these engines' behavior. We hadn't lied; we'd just oversampled certainty. So we ran the structured experiment that produced the 800-run baseline. The point of the baseline wasn't to find a tier rate. It was to find out how many reps you needed before the tier rate stabilized. We ran each of our 40 baseline prompts on each of 4 engines, 5 times each (the 800 runs). For each prompt-engine pair, we asked: how does the modal tier code change as we add more reps? After 1 rep: tier code "agrees with the 5-rep mode" about 67% of the time. After 2 reps (modal of two): about 78%. After 3 reps: about 88%. After 4 reps: about 95%. After 5 reps: by definition 100% of the 5-rep mode, used as the reference. A third of single-run audits, by this measure, return a tier code that doesn't match the underlying signal once you sample more deeply. That's the noise floor. Audits that don't account for it are presenting noise as if it were signal. We've since pre-registered 5 reps as our minimum for client-facing audits. The agency I work with has burned the report templates that used 1-shot data, partly to remove the temptation to fall back to them under deadline pressure. A few reasons, none of them surprising once you see them: First, the engines are non-deterministic by design. Temperature, sampling, and routing decisions vary run to run. Even if the underlying retrieval is stable, the synthesized answer isn't. Second, the retrieval surface itself is volatile. Perplexity in particular re-queries the live web, and what gets surfaced on a Tuesday morning may not be what gets surfaced Thursday afternoon. Crawl freshness, server response times, and CDN caching all influence what's available to cite. Third, prompt phrasing has subtle effects. The same intent expressed two days apart by the same human can end up phrased slightly differently, and small phrasing changes can route to different sub-systems inside an engine. We've tried to control for this by holding prompt phrasing constant across reps; even doing that, output variance is meaningful. Running 5 reps instead of 1 is 5x the data collection effort. That's real. In our process, we've automated screenshot capture and citation extraction enough that the marginal cost per rep is mostly engine response time, not human time. Coding is still human. We've added a second coder on a subset of runs to measure inter-rater reliability, which adds further overhead. For clients, this affects pricing and timelines. A "fast audit" that promises results in three days using single-rep methodology is, in our view, selling a partial product. We've lost some prospective engagements where speed was the deciding factor. We've kept the engagements where the buyer cared about whether the audit told them something true. We'd start with a small replication study before any client work. Even a 10-prompt rep-curve study takes maybe a day and would have saved us the credibility cost of the early single-run reports. We didn't do that. We assumed the engines were more stable than they are. We'd also be more aggressive about reporting confidence ranges, not point estimates. The "23% A+B tier" number from our baseline has a meaningful confidence interval around it. We've started reporting that interval in client work. It's harder to communicate than a clean point estimate. It's more honest. Our standard audit deliverable has changed in three ways since we adopted the 5-rep minimum. First, every tier-rate number comes with a confidence range. "23% A+B tier, with a 95% confidence interval of roughly 19-27% given our sample size" is what we write now. The interval is wider than clients sometimes expect. We've found that the clients who push back on the interval are usually the ones we end up disappointing later; the clients who accept the interval as honest tend to be the ones we work with productively over the long run. Second, we explicitly call out tier shifts that occurred between reps. "On 14 of 40 prompts we observed at least one tier shift across the 5 reps, which means a single-run audit would have given a misleading code on those prompts" is the kind of sentence we now include. This makes the report longer and the reader's job harder. We think it's worth it. Third, we include a methodology section that names what we did and didn't control for. Pre-registration status. Whether the coder was blind. Whether prompts were paraphrased between reps. Whether the audit was run across time of day, time zone, account state, or other variables that might affect engine routing. Most of those answers are still "no, we didn't fully control for that," but writing them down keeps us honest about what we know. Five reps means more data to capture and code. We've leaned on automation for the capture side (screenshots, citation rail extraction, prompt logging) and kept humans on the coding side. We've experimented with using an LLM to do first-pass tier coding, and the results have been promising-but-not-yet-reliable: the LLM agrees with human coders on about 84% of records in our internal tests, which is good enough to be useful as a first pass but not good enough to ship unchecked. Our current workflow is: automated capture, LLM first-pass coding, human review with the LLM's coding visible as a prompt, second human coder on a 20% sample for inter-rater reliability. This roughly doubles per-audit throughput compared to all-human coding without measurably degrading reliability in our spot checks. The agency I work with is still iterating on this stack and we've ruled out fully automated reporting for the foreseeable future. The cost of a confident-sounding wrong audit is too high. Five reps is the minimum that worked in our setup. It's not necessarily the right minimum for everyone. If your prompt set has higher intrinsic variance (very ambiguous prompts, very volatile topics, very fresh news cycles), you may need more. If your prompts are tightly scoped factual questions about stable topics, you might get away with fewer, but I'd want to measure that before claiming it. The open question I haven't answered yet: does the rep-curve shape vary by engine? My intuition is that Perplexity needs more reps than ChatGPT, but I haven't seen the breakdown cleanly in our data. If anyone has run that comparison rigorously, I'd want to read the methodology. There's also a meta-question I keep coming back to. Five reps stabilizes the modal tier code, but the variance itself is information. A prompt where five reps return five different tiers tells you something different from a prompt where five reps all return the same tier. We've started reporting both the modal tier and a stability score per prompt. Whether clients find that useful is still an open question; some have, some haven't. If you're auditing AI search performance for a client right now using single-run data, what would it take to get you to add a second pass? In our experience the answer was an embarrassing client follow-up. There's a cheaper way to learn this lesson.