Why Your Recommendation Engine Passes Every Test and Fails in Production
Offline metrics look clean. The problem isn't the model. It's what the model is ranking against. This pattern shows up across recommendation engine audits: The team ships a new model version. Offline retrieval score improves. A/B test shows neutral-to-positive CTR. Three months later, conversion is still flat. In 20 audits, 18 teams answered with a model name. Two answered with a pipeline diagram. The two that answered with a pipeline diagram had fixed the problem. Offline retrieval metrics measure how well your model ranks items against a historical behavior sample. They cannot measure two things: Whether the behavioral signals feeding the model are fresh Whether those signals belong to the right user Both failures are silent. The model scores look correct. The production output is fiction. signal_age = current_timestamp - last_event_timestamp relevance_loss ≈ f(signal_age, behavior_drift_rate) Customer behavior shifts daily — sometimes hourly around promotions, seasonal events, or price changes. If your embeddings refresh weekly, you are ranking users against a week-old snapshot of their preferences. The model is ranking a ghost. From a real rebuild: Metric Value Embedding refresh cadence Weekly Customer behavior shift Daily (sometimes hourly) Real retrieval accuracy 38% (not the 0.91 offline score) After pipeline rebuild (same model) 87% CAC reduction, next quarter −34.7% No model changes. Same architecture. Zero new training data. The pipeline feeding the model was wrong. The staleness problem is the first layer. The identity problem is the second. How many device keys does your pipeline assign to one customer on a cross-device journey? customer_A (mobile) → profile_1 → ranking_1 customer_A (desktop) → profile_2 → ranking_2 customer_A (app) → profile_3 → ranking_3 Three different recommendation strategies for one converting customer. The engine ranks each in isolation, because the identity layer never merged them. The fix is upstream of the model: session stitching, device graph resolution, cross-channel event merging. None of it requires retraining. Run these before scheduling the next training run: 01 — Signal freshness signal_age > behavior_drift_window, your ranking is stale by definition. 02 — Identity coverage 03 — Offline-online metric gap 04 — Conversion window alignment A recommendation engine ranks signals. If the signals are stale or fragmented across unresolved identities, the ranking is precise noise. Fix the pipeline. The model is the last thing to change. How fresh are the behavioral signals entering your recommendation model at serving time? That single number explains most of the offline-online gap I've seen — curious what it looks like in your stack. vf-insights.com
