AI News Hub Logo

AI News Hub

I Replaced $800/mo in API Costs with a Local Llama 4 Setup for E-Commerce

DEV Community
doltter

My team runs an e-commerce operation that pushes around 80,000 product descriptions through LLMs every month. We were spending $800+ on GPT-4o API calls. Last month we moved the bulk generation pipeline to Llama 4 Maverick running locally via Ollama. Monthly cost dropped to about $40 in electricity. Here's the full setup, what worked, what didn't, and where we still use cloud APIs. Three reasons pushed us off the API-only approach: Cost at scale. 80K descriptions at ~500 tokens each, GPT-4o was billing us $600-800/month. Fine for a startup burning cash, not great when you're trying to run profitably. Data privacy. We process competitor pricing data and customer purchase history for segmentation. Sending that to a third-party API means it leaves your infrastructure. With GDPR customers in our mix, local processing just removes an entire category of compliance headaches. Rate limits and latency. During product launch weeks we'd hit rate limits and queue up requests. A local model doesn't throttle you — it just runs as fast as your GPU allows. We tested on three setups: Machine VRAM Speed Notes Mac M3 Max 64GB Unified ~18 tok/s Fine for dev/testing, too slow for batch RTX 4090 24GB 24GB ~35 tok/s Our production choice. Handles 800-1200 descriptions/hr 2x RTX 4090 48GB ~55 tok/s Overkill for our volume, but nice for parallel jobs If your VRAM is too small, Ollama silently falls back to CPU. You'll see 3-5 tokens/second and wonder what went wrong. Check ollama ps to verify the model loaded onto GPU. Install Ollama: # macOS brew install ollama # Linux curl -fsSL https://ollama.com/install.sh | sh Pull the Hermes fine-tune of Maverick (this is the version you want — base Maverick has flaky JSON output): ollama pull hermes3:maverick # ~25GB download, grab a coffee Start the server and test: ollama serve & curl http://localhost:11434/api/generate -d '{ "model": "hermes3:maverick", "prompt": "Generate a product title for: wireless bluetooth earbuds, IPX7 waterproof, 30hr battery, noise cancelling.", "stream": false }' We wasted two days on base Maverick before switching. The difference: JSON output: Base Maverick returns valid JSON about 88% of the time. Hermes hits 97%+. When you're generating 80K items, that 9% gap means thousands of failed parses and retries. Function calling: We use tool calls to pull inventory data mid-generation. Base model: 78% accuracy. Hermes: 93%. System prompt adherence: Tell base Maverick "always respond in German" and it drifts back to English after ~20 turns. Hermes stays consistent. Our production script is a Python worker that reads from a job queue and writes to a database. Here's the core of it: import httpx import json OLLAMA_URL = "http://localhost:11434/v1/chat/completions" def generate_description(product: dict, lang: str = "en") -> dict: prompt = f"""Write a product description for an e-commerce listing. Product: {json.dumps(product)} Language: {lang} Output JSON: {{"title": "...", "description": "...", "bullet_points": [...]}} Only output the JSON object, nothing else.""" resp = httpx.post(OLLAMA_URL, json={ "model": "hermes3:maverick", "messages": [ {"role": "system", "content": "You are a product copywriter. Output valid JSON only."}, {"role": "user", "content": prompt} ], "temperature": 0.7, }, timeout=60) text = resp.json()["choices"][0]["message"]["content"] text = text.strip().removeprefix("``` json").removesuffix(" ```").strip() return json.loads(text) product = { "name": "Wireless Earbuds Pro", "material": "ABS plastic, silicone tips", "features": ["IPX7 waterproof", "30hr battery", "ANC"], "price_range": "$25-35" } result = generate_description(product, lang="de") print(json.dumps(result, indent=2, ensure_ascii=False)) Ollama's API is OpenAI-compatible, so if you have existing code that calls openai.ChatCompletion, change the base URL and model name. That's it. from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="not-needed" ) response = client.chat.completions.create( model="hermes3:maverick", messages=[{"role": "user", "content": "your prompt here"}] ) We didn't move everything off cloud APIs. Three tasks still run through Claude or GPT-4o: Brand voice copy — when the output needs to sound like a specific brand, cloud models are noticeably better. Maverick writes competent descriptions but they read a bit flat compared to Claude's output. Anything under 10K requests/month — the break-even point is somewhere around 50K monthly requests. Below that, GPT-4o-mini at $150/mo beats the hassle of maintaining local hardware. One-off creative tasks — ad headlines, email subject lines, anything where you want to iterate on quality. Cloud models with their bigger parameter counts just produce more varied and interesting options. Before (API only) After (hybrid) Bulk descriptions (80K) $620 (GPT-4o) $40 electricity Creative copy (5K) $180 (Claude Sonnet) $180 (Claude Sonnet) Ad headlines (2K) $30 (GPT-4o-mini) $30 (GPT-4o-mini) Total $830/mo $250/mo Hardware was a one-time $1,800 for the RTX 4090 rig. Paid for itself in three months. If you're building something similar, I put together a few things that might help: awesome-ai-ecommerce-tools — curated list of 42+ AI tools for e-commerce, including local deployment options Detailed Llama 4 vs Claude vs GPT cost breakdown — hardware configs, benchmark numbers, and use case recommendations Ollama docs — the official setup and API reference Hermes on HuggingFace — model weights and fine-tune details The local LLM space moves fast. Six months ago, running a model this capable on a single consumer GPU wasn't realistic. Now it costs less than a Netflix subscription to process volumes that used to run up four-figure API bills. If you're hitting scale where API costs hurt, it's worth an afternoon to test.