AI Is Very Good at Implementing Bad Plans

DEV Community

Hector Haung

May 2, 2026, 04:57 AM

Most AI coding posts focus on the code: which model writes cleaner functions, which one needs less prompting, which one hallucinates less. But the code isn't usually where my projects break. The plan is. A few weeks ago I asked Claude Code to plan a BigQuery dedup pipeline. Routine stuff. Pull events from Postgres into GCS, load into BigQuery, dedup by event ID, impute some missing checkout rows. The plan came back in maybe 90 seconds. Six steps, clean SQL, sensible-looking error handling. I almost just told it to start coding. Then I tried something. I sent the same plan to Codex and Gemini, and asked each one separately to break it. Three models. Same plan. No shared context. None of them knew what the others wrote. Here's what came back. All three caught the same dedup bug: INSERT INTO order_events_dedup step wasn't idempotent. Any retry doubled yesterday's rows. The existing alert ("less than 50% of expected") is one-sided and would never fire on over-counts. That's the easy one. The interesting findings were the ones only one model caught. Only Claude caught this: WHERE m2.user_id = user_id doesn't bind the way the writer intended in BigQuery's scoping rules. The imputation step would silently do nothing after day one. The pipeline's whole purpose (filling in missing checkout events) would fail invisibly for 2–8 weeks before anyone noticed. Codex and Gemini both quoted this exact SQL block in their reviews. Neither tested whether the join actually binds. Only Gemini caught this: GROUP BY only sees within a partition. The cross-partition pair never gets deduped. Only Codex caught this: Three different blind spots. Three different models. If I'd just gone with any one model's review, I'd have shipped two of these bugs. The actual loop is small enough to fit in a bash script. PROMPT=$(cat prompts/system-prompt.md plan.md) echo "$PROMPT" | claude --print > out/claude.md & echo "$PROMPT" | codex exec --skip-git-repo-check > out/codex.md & echo "$PROMPT" | gemini --skip-trust > out/gemini.md & wait # 4th call merges and ranks the three reviews { cat prompts/consolidation-prompt.md \ out/claude.md out/codex.md out/gemini.md; } \ | claude --print > out/ranked.md Three CLIs in parallel. Same prompt. No shared context. A fourth call to merge. Wall time: 5–15 minutes (the merge step dominates). Cost: about $0.10–0.20 for a sample plan, $0.50–2.00 for production-size. The prompt sent to all three models has one job: force concrete failure scenarios, reject abstract advice. Five dimensions: 1. HIDDEN ASSUMPTIONS — ordering, uniqueness, atomicity, data freshness, caller behavior. What does this design implicitly depend on? 2. DEPENDENCY FAILURES — upstream/downstream services, external APIs, databases, messaging. What breaks if a dependency degrades? 3. BOUNDARY INPUTS — empty, single, huge batch, malicious, malformed. 4. MISUSE PATHS — caller misbehavior, user skipping steps, out-of-order operations. 5. ROLLBACK & BLAST RADIUS — how to recover, scope of damage. 5-minute detection vs 5-day detection? For each scenario: - TRIGGER: what causes it - IMPACT: who is affected, how badly - DETECTABILITY: how long until noticed Reject abstract advice like "add monitoring". Specify what metric, what threshold, what alert. That last paragraph is doing most of the work. Without it you get "consider rate limiting" and "ensure proper error handling." With it you get the midnight-boundary race. Three models in parallel isn't impressive. Anyone can run three CLIs. The thing that surprised me is how rarely the unique findings overlap. Claude tends to over-warn. It flags five defensive checks that aren't really bugs. But it actually reads the SQL. Codex is concise. It skips integration details, but it notices file-format and infra failure modes the others gloss over. Gemini stays surface-level a lot of the time. But when it does dig in, it's often a concurrency or partition issue the others missed. You don't get this from ensemble averaging. The consensus findings are the obvious ones. The unique findings are the ones a single-model review would have quietly missed. That's the whole point. This is a workflow, not a system. No orchestrator, no shared scratchpad, no consensus protocol, no agent class hierarchy. Three CLIs in parallel. A fourth call to merge. If you want an installed framework with marketplace plugins, there are several. This is the opposite shape: ~30 lines you paste into your CLAUDE.md, and the next time you ask Claude Code to review a plan, it fans out to Codex and Gemini in parallel and brings back a merged report. The full method, both case studies (the BigQuery pipeline above plus a Cloud Run + Workflows deploy), and the 100-line redteam.sh are in a small repo: https://github.com/permoon/multi-model-redteam Three install tiers depending on what you have set up: Tier 0: paste 30 lines into CLAUDE.md. No install. Tier 1: git clone and run the bash script. Tier 2: copy the prompt into Claude / ChatGPT / Gemini's chat UI. One model only, but better than no frame. It's also a teaching repo. Seven chapters, from "why one LLM isn't enough" to the parallel script. The reason I'm posting this: I want to know if other people are doing something similar. Are you red-teaming AI-generated plans before letting the model implement them? With one model? Multiple? Or are you mostly trusting the plan and reviewing the code afterward? If you've tried this and it didn't work for you, I'd especially like to hear that.