Benchmark Results: SmolLM3 3B, Phi-4-mini, DeepSeek V4, Grok 4.20 — Agent Coding Tested

DEV Community

Vilius

May 12, 2026, 06:37 PM

The second round of the Works With Agents agent coding benchmark is in — 32 models tested this time, up from 10. And the results are not what anyone expected. Rank Model Score 🥇 SmolLM3 3B 93.3 🥈 Phi-4-mini 90.0 🥉 Claude Sonnet 4 85.0 4 Qwen2.5 1.5B 85.0 5 Qwen2.5 3B 85.0 6 Granite 3.2 2B 82.5 7 Ministral 3B 81.7 8 Mistral Large 3 79.6 9 Gemma 4 31B 78.3 10 Gemma 4 26B A4B 78.3 A 3-billion-parameter model from Hugging Face scored 93.3 — eight points ahead of Claude Sonnet 4. Phi-4-mini (also a tiny model) took second at 90.0. Qwen2.5's 1.5B and 3B variants tied Claude at 85.0. Model Score Claude Sonnet 4 85.0 Gemini 2.5 Flash 76.4 GPT-5.4 76.6 Kimi K2.6 75.0 Grok 4.20 75.0 MiniMax M2.7 69.9 DeepSeek V4 Flash 60.0 GPT-5.5 60.0 GPT-5.4 Pro 51.6 GPT-5.5 Pro 43.3 DeepSeek V4 Pro 38.3 Grok 4.20 debuted at 75.0 — tied with Kimi K2.6, ahead of its Fast sibling (74.9). DeepSeek V4 Pro scored 38.3, well below its Flash variant. GPT-5.5 Pro and GPT-5.4 Pro both underperformed their base models substantially. The task evaluates real agent coding over 12 rounds: Multi-file edits (Python, shell scripts) Git operations (clone, branch, commit) Shell command execution Bash scripting with pipes and redirects Recovering from errors Score = weighted average of correctness (70%) and efficiency (30%). Models lose points for failed tool calls, wrong commands, and unnecessary steps. Model Score DeepSeek-R1 1.5B 27.5 Qwen3.5 0.8B 26.0 Google Lyria 3 Pro 8.3 Google Lyria 3 Clip 0.0 The smallest models (sub-2B reasoning models) couldn't complete basic tool sequences. Google's Lyria models in particular struggled — Lyria 3 Clip scored zero, unable to produce any working output. Small models are getting dangerously good at agentic coding. SmolLM3 3B — a model you can run on a laptop — outperformed every frontier model by a wide margin. The benchmark suggests model size isn't the bottleneck for agent coding ability. Full results and methodology: benchmarks.workswithagents.dev The benchmark runs continuously — new models are added as they become available. If you're building a model that should be tested, the API is open.