AI News Hub Logo

AI News Hub

How I cut AI calls by 95% without losing quality?

DEV Community
Anupam Kushwaha

The Hidden Cost of Calling AI Too Early I stopped calling AI on every request — and everything got better. In one of my projects, I was generating AI-based insights from user activity. The initial design was simple: Every request for today’s insight → call the AI model → return a fresh response. GET /api/insights/today At first, this felt clean and correct. But in practice, it created serious problems: 429 rate limit errors within hours Daily quota exhausted before noon Random failures affecting users Costs scaling linearly with traffic The system was working — but it wasn’t sustainable. The problem wasn’t the AI provider. It was the trigger model. The system never asked basic questions before making an expensive call: Has anything actually changed? Did I already generate a response recently? Is the user even active today? Without these checks, every request was treated as: “Generate a new insight now.” That assumption was the real bug. Instead of adding caching on top, I redesigned the system into an event-driven pipeline. AI became the last step, not the default. Here’s the simplified request flow: flowchart TD A[Request for today's insight] --> B{Activity today?} B -- No --> C[Reuse latest insight or fallback] B -- Yes --> D{Meaningful change?} D -- No --> C D -- Yes --> E{Cooldown passed?} E -- No --> C E -- Yes --> F{Daily cap reached?} F -- Yes --> C F -- No --> G{Global AI limit reached?} G -- Yes --> H[Use deterministic fallback] G -- No --> I[Call AI model] I --> J[Persist insight] H --> J C --> J Most requests now end at a simple database read — not an AI call. Add your architecture / sequence diagram here ![System Flow](your-image-url-or-upload) Start with the cheapest check: boolean hasActivity = activityService.hasActivityToday(userId, context); if (!hasActivity) { return getLatestOrFallback(userId, today); } If nothing happened → don’t call AI. AI should only run when something meaningful changes. Examples: user updates intent significant behavior change threshold crossed No change → reuse previous insight. Avoid frequent re-generation: Duration cooldown = Duration.ofMinutes(30); if (elapsed = 10) { return getLatestOrFallback(userId, today); } Even active users shouldn’t trigger unlimited AI calls. if (dailyAiCalls.get() >= 50) { useFallback = true; } This acts as a system-wide circuit breaker. All thresholds are configurable: insight: activity-delta: 30 cooldown-minutes: 30 daily-cap-per-user: 10 max-ai-calls-per-day: 50 freshness-window-hours: 8 This allows tuning without redeploying code. After this redesign: AI calls dropped from ~100/day → ~5–10/day Rate limit errors disappeared Most requests became fast database reads Free-tier usage became sustainable System behavior became more predictable AI should be the exception, not the rule. A well-designed backend should first decide: “Is this request even worth sending to the model?” That decision layer — gating, triggers, cooldowns — is where the real engineering happens. If most requests can be handled using deterministic logic or cached state: Do that first. Use AI only when it actually adds value. That single shift can make your system: cheaper faster more reliable —and much easier to scale. ## blog link - https://anupamkushwaha.me/blog/stopped-calling-ai-on-every-request