I built 14,000 lines of code before talking to a single user. Here's what I learned.
10 weeks ago I was convinced I was building something nobody else had. The idea: a GitHub App that reads every changed Python function in a PR, infers what it should always do — "the total should never be negative", "the result should never be None" — then attacks it with adversarial inputs in a hardened Docker sandbox. Only posts a comment when it has concrete proof something broke. No guessing. No noise. Just: LogoMesh found 1 issue Negative quantity bypasses checkout validation Property: Order total should always be ≥ 0 I called: checkout(item_id=1, qty=-5) Got: Order created with total -$49.95 207 unit tests passing. Docker sandbox with nobody user, --network=none, memory caps, randomized filenames. Time-Travel Trace that captures exact variable state at the crash frame. 14,000 lines of code. Zero customer conversations. Then I looked at the real numbers. 90% silence rate on real PRs. Of the findings it did post, most were sandbox artifacts — the tool generating tests that hit its own scaffolding, not real bugs. A validator that drops 86% of raw findings just to keep the noise low. p95 latency of 394 seconds against a 60 second target. Classic mistake. Built first, validated never. So I stopped and started talking to people. What I kept hearing wasn't "I need adversarial testing on my PRs." It was something simpler: "When a prod bug hits I spend 30-45 minutes just writing the repro test before I can even start fixing." Read the Sentry trace. Reconstruct the state. Write a test. Run it. Wrong inputs. Fix the test. Run it again. By the time it reproduces you've lost the whole debugging flow. That's the actual pain. Not catching future bugs — dealing with the ones already in production. So now I'm building something much simpler. Paste a Sentry URL, get a failing pytest that reproduces the exact crash, runs against your current branch, tells you "still reproduces" or "your branch fixed it." One command. Under a minute. The engine I spent four weeks on isn't wasted - the Docker sandbox, the frame locals injection, the binary verdict reliability - those are exactly what make this work. The positioning just changed completely. The question I'm still trying to answer: Is the 30-45 minute manual repro step a universal pain or just my workflow being slow? Do you write a reproducing test before fixing a prod bug, or do you just read the trace, push the fix, and monitor? Genuinely trying to figure out if I'm solving a real problem this time before writing another 14,000 lines nobody asked for.
