AI News Hub Logo

AI News Hub

Hybrid LLM Routing: Ollama + Claude API Without Quality Degradation

DEV Community
Ravil Minigulov

The bill arrives at the end of the month Why "just use Ollama" doesn't work Architecture: one interface, two tiers The router: asymmetry of error cost target = ( ModelTarget.CLOUD if signal.score > 0.35 or signal.confidence < 0.6 else ModelTarget.LOCAL ) confidence < 0.6 — if the router isn't confident enough in its classification, the request goes to Claude. Explicit codification of the asymmetry. Three things that break in production Observability The numbers Real case: a Telegram bot for a café, one month of observation after rolling out the router. Request typeTraffic shareModelFAQ, hours, address, prices61%OllamaMenu clarifications, ingredients18%OllamaEdge cases, complaints12%ClaudeRAG over documents, generation9%Claude Cost before: $234/month. After: $47/month. Quality by client complaints — unchanged: the scenarios that used to go to Claude still go to Claude. The 80% cost reduction isn't the goal of the architecture. It's a side effect of making request cost a function of complexity rather than a constant. The real gain: the system became legible. Now you can see what each interaction type costs and know exactly what to do about it when traffic grows.