AI News Hub Logo

AI News Hub

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

DEV Community
Nate Voss

If you're still picking LLM providers by gut feeling, you're leaving money on the table. I ran 5 developer use cases through Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash using PromptFuel to measure token usage and cost. The results? More interesting than "fastest wins." Here's what I found. I took 5 tasks I actually do in PromptFuel development: JSON schema validation prompt — catch malformed API responses Code review feedback — multi-file analysis with context Refactoring suggestion — optimize a chunky utility function Bug diagnosis — trace through a stack trace with logs Documentation generation — write API docs from code comments Each got run through all three models with identical input. I used PromptFuel's CLI to count tokens and calculate costs, because doing this manually is chaos. Output quality was rated by me (subjectively, but honestly). Input: Schema definition + malformed JSON sample + expected error message format Token usage (input → output): Claude Sonnet: 1,847 → 512 (cost: $0.0043) GPT-4o: 2,156 → 487 (cost: $0.0082) Gemini Flash: 1,923 → 501 (cost: $0.0001) Quality: All three nailed it. Claude was most concise in its explanation. GPT-4o over-explained. Gemini was crisp and useful. Token efficiency win: Gemini, by cost. Claude, by clarity per token. Input: Three TypeScript modules + review instructions + examples of good feedback Token usage: Claude Sonnet: 4,231 → 891 (cost: $0.0147) GPT-4o: 4,782 → 856 (cost: $0.0208) Gemini Flash: 4,456 → 823 (cost: $0.0003) Quality: Claude caught subtle issues I actually cared about. GPT-4o was thorough but verbose. Gemini gave surface-level feedback. Token efficiency win: Gemini cheapest. Claude best output/token. Input: 80-line utility function + performance requirements + current bottleneck description Token usage: Claude Sonnet: 2,134 → 618 (cost: $0.0054) GPT-4o: 2,445 → 602 (cost: $0.0110) Gemini Flash: 2,287 → 587 (cost: $0.0002) Quality: Claude's refactor was production-ready. GPT-4o suggested good ideas but with syntax issues. Gemini's suggestion worked but wasn't elegant. Token efficiency win: Gemini cost, Claude quality. Input: Stack trace (15 lines) + error logs (20 lines) + code snippet (40 lines) + attempted fixes tried Token usage: Claude Sonnet: 2,856 → 445 (cost: $0.0071) GPT-4o: 3,102 → 421 (cost: $0.0127) Gemini Flash: 2,934 → 438 (cost: $0.0002) Quality: Claude nailed it immediately. GPT-4o circled around the issue. Gemini flagged the right file but not the root cause. Token efficiency win: Gemini cost, Claude accuracy. Input: 12 functions with JSDoc comments + expected markdown format + examples Token usage: Claude Sonnet: 3,445 → 734 (cost: $0.0118) GPT-4o: 3,821 → 689 (cost: $0.0182) Gemini Flash: 3,567 → 712 (cost: $0.0004) Quality: Claude's docs were complete and well-structured. GPT-4o was good but required minimal cleanup. Gemini's docs were functional but missing details. Token efficiency win: Gemini cost, Claude completeness. 1. Cost-per-task != best value. Gemini Flash is comically cheap (~90% less than GPT-4o), but you're paying for what you get. When I needed high-stakes work (code review, bug diagnosis), Claude was worth the extra cents because I didn't have to iterate. For throwaway tasks (generating examples, formatting), Gemini's cost made its mediocrity acceptable. 2. Token count is not predictive of quality. All three models produced similar token counts for the same input, but output quality varied wildly. GPT-4o consistently used more tokens and wasn't proportionally better. Claude packed useful signal into fewer tokens. This matters: if you're optimizing for cost alone, you'll pick the wrong model. 3. Real-world testing beats benchmarks. The model rankings flip depending on what you're actually doing. For documentation, Claude wins. For budget validation of a throwaway check, Gemini wins. Generic "fastest model" articles don't capture this. You need to test your actual tasks. Here's the thing: this comparison is data, not law. Your tasks might weight differently. Let me show you how I tested this using PromptFuel. # Install PromptFuel (if you haven't) npm install -g promptfuel # Create a test file with your prompt cat > test-prompt.txt << 'EOF' [your prompt here] EOF # Count tokens across models pf count test-prompt.txt --model claude-3-5-sonnet pf count test-prompt.txt --model gpt-4o pf count test-prompt.txt --model gemini-2.0-flash # Compare costs pf count test-prompt.txt --compare That --compare flag gives you a cost matrix. Takes 30 seconds. Beats guessing. The real insight: run this for your specific use cases. A document summarizer might favor Claude. A high-throughput classification pipeline might favor Gemini. The only way to know is to test. After picking your model, there's still money left on the table. Here's a before/after from actual PromptFuel code: Before (unoptimized prompt): You are an expert code reviewer. Review the following code for quality, security, and performance issues. Check for common bugs, suggest improvements, and rate the code from 1-10. Consider edge cases, error handling, and best practices. Be thorough and detailed in your feedback. [400 tokens of instructions] [200 tokens of examples] [150 tokens of code to review] Total: ~750 input tokens After (optimized with PromptFuel): Review code for quality, security, performance. Rate 1-10. [Stripped redundant instructions] [Examples reduced to 1 exemplar instead of 3] [Code reformatted to remove whitespace] Total: ~420 input tokens Cost saved: ~$0.0012 per review on Claude. Run that 100 times a day, and you're saving $0.12/day, $36/year. Small? Yes. Multiplied by 50 internal tools? Now you're talking real money. Pick the model that gives you the output you need, then optimize the prompt. Stop optimizing for the wrong metric. Benchmarks are fun, but production bills are real. If you're running this analysis for your own stuff, PromptFuel makes it stupidly easy. It's free, no API keys needed, runs locally. Just npm install -g promptfuel and compare. If you want the actual numbers from your prompts, run the test. Don't inherit my data — build your own. What's your highest-volume LLM task? Test it. You might be surprised which model wins. Tags: #ai #tutorial #javascript #optimization