AI News Hub Logo

AI News Hub

LLM-as-Judge: Automated Quality Gate for LLM Outputs in Production

DEV Community
Roman Belov

LLM-as-Judge is a pattern where one language model evaluates another model's outputs against defined criteria. An automatic quality gate: every response gets checked before reaching the user, or after, for monitoring. Standard production monitoring metrics (200 OK, latency 340ms, rate limits within bounds) are useless for assessing quality — the model can hallucinate in 15% of responses while HTTP status codes tell you nothing about it. Manual review doesn't scale. One person can handle 100 requests a day. At 10,000, nobody can. And quality degradation usually hits at scale: after a prompt update, a model swap, or a silent change on the provider side. This article covers how LLM-as-Judge works, which metrics to evaluate, and how to plug it into a production pipeline. The judge model receives a prompt with instructions plus the text being evaluated, then returns a score: a number, a category, or structured JSON. The judge doesn't generate content. It classifies and scores. Models handle this more consistently than generation. User: "Recommend cafes in downtown Moscow" | v +--------------------+ | LLM Generator | -> "Here are 5 cafes: Coffemania near Patriarshiye..." | (GPT-4o-mini) | +--------------------+ | v +--------------------+ | LLM Judge | -> { relevance: 0.9, factuality: 0.7, | (Claude Sonnet) | toxicity: 0.0, completeness: 0.8 } +--------------------+ | v Score Alert / Block / Log Research by Zheng et al. (2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena") showed that GPT-4 as a judge agreed with human ratings in 80%+ of cases. Two human annotators agreed with each other about 81% of the time. The gap between an LLM judge and a human is roughly the same as the gap between two humans. Metric choice depends on the task. Main categories below. Metric What it checks When you need it Faithfulness Response is grounded in context, no fabricated facts Always for RAG Answer Relevance Response matches the question Always Context Relevance Retriever returned relevant documents Debugging retrieval Metric What it checks When you need it Correctness Factual accuracy When a reference answer exists Completeness Response covers all aspects of the query Complex queries Toxicity No insults, harmful content User-facing products Hallucination Model doesn't fabricate facts Always Metric What it checks When you need it Tool Use Correctness Right tool with right arguments Agent pipelines Task Completion End result solves the task Always for agents In practice, start with two or three metrics. For RAG: faithfulness + answer relevance. For a chatbot: relevance + toxicity. For an agent: task completion. Add more as you find specific problems. Evaluation quality comes down to the prompt. A working template for faithfulness: FAITHFULNESS_JUDGE_PROMPT = """You are an impartial judge evaluating the faithfulness of an AI assistant's response. Faithfulness means: every claim in the response is supported by the provided context. Claims not found in context = unfaithful. ## Input **User Question:** {question} **Retrieved Context:** {context} **AI Response:** {response} ## Task 1. Extract each factual claim from the AI Response 2. For each claim, check if it is supported by the Retrieved Context 3. A claim is SUPPORTED if the context contains evidence for it 4. A claim is UNSUPPORTED if the context does not mention it or contradicts it ## Output (JSON only) {{ "claims": [ {{"claim": "...", "supported": true/false, "evidence": "..."}} ], "score": , "reasoning": "" }}""" What makes this work: Specific criteria. "Rate the response quality" doesn't work. "Check that every fact is backed by context" works. The more specific the instruction, the more stable the scores. Chain-of-thought. The model first extracts claims, checks each one, then assigns a score. Without intermediate steps, scores are unstable. Structured output. JSON with a fixed schema, score from 0 to 1, reasoning in one sentence. This makes parsing and aggregation straightforward. Minimal implementation, no frameworks: import json from litellm import completion def evaluate_faithfulness(question: str, context: str, response: str) -> dict: judge_response = completion( model="anthropic/claude-sonnet-4-20250514", messages=[{ "role": "user", "content": FAITHFULNESS_JUDGE_PROMPT.format( question=question, context=context, response=response, ) }], response_format={"type": "json_object"}, temperature=0, ) result = json.loads(judge_response.choices[0].message.content) return result eval_result = evaluate_faithfulness( question="Какие кафе в центре Москвы?", context="Кофемания: Патриаршие пруды. Сёстры: Покровка 6.", response="Рекомендую Кофеманию на Патриарших и Пушкин на Тверском бульваре.", ) # score: 0.5 (Кофемания confirmed, Пушкин is not) Pros: full control, minimal dependencies. Cons: you write every metric yourself, no batch processing. If you work with multiple LLM providers, litellm lets you switch between them through a single interface — more on this in the article about multi-provider LLM architecture. Open-source framework with built-in metrics. Works like pytest for LLM outputs. from deepeval import evaluate from deepeval.test_case import LLMTestCase from deepeval.metrics import ( FaithfulnessMetric, AnswerRelevancyMetric, HallucinationMetric, ) faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o") relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o") hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o") test_case = LLMTestCase( input="Какие кафе в центре Москвы?", actual_output="Рекомендую Кофеманию на Патриарших...", retrieval_context=["Кофемания: Патриаршие пруды. Сёстры: Покровка 6."], ) results = evaluate([test_case], [faithfulness, relevancy, hallucination]) 14+ built-in metrics, pytest integration. LLM quality tests run alongside unit tests: # test_llm_quality.py from deepeval import assert_test def test_travel_recommendations(): test_case = LLMTestCase( input="Кафе в Москве", actual_output=run_my_pipeline("Кафе в Москве"), retrieval_context=get_retrieved_docs("Кафе в Москве"), ) assert_test(test_case, [faithfulness, relevancy]) If you already use Langfuse for tracing, evaluations plug in on top. The judge model runs against each trace and attaches a score to it. Scores can be attached to an entire trace or to individual observations. If you haven't set up an observability stack yet, start with the practical guide to LLM observability with Langfuse. langfuse.score( trace_id="trace-abc-123", name="faithfulness", value=0.85, comment="1 of 7 claims not supported by context", ) For production monitoring, Langfuse fits better than DeepEval: scores are tied to real traces, visible in the dashboard, with day-over-day quality degradation charts. Prompt changed? Run a dataset through the judge model before deploying. Score below threshold — deploy blocked. # .github/workflows/llm-quality.yml name: LLM Quality Gate on: [pull_request] jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install deepeval - run: deepeval test run test_llm_quality.py env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} For high-stakes tasks, evaluate before sending the response: async def generate_with_quality_gate(question: str) -> str: response = await generate_response(question) eval_result = await evaluate_faithfulness( question=question, context=retrieved_context, response=response, ) if eval_result["score"] scores | | | | | v | | Score Block merge | +---------------------------------------------------+ +---------------------------------------------------+ | Production | | | | User request -> LLM -> Response -> User | | | | | v (async) | | Langfuse trace | | | | | v (cron, hourly) | | Judge evaluation (sample) | | | | | v | | Score dashboard + alerts | +---------------------------------------------------+ Pick one metric. For RAG: faithfulness. For a chatbot: answer relevance. Collect 20-30 examples by hand: questions, answers, ratings (good/bad). A golden dataset for calibration. Write a judge prompt, run it against the golden dataset. Agreement with human ratings below 70%? Revise the prompt. Add DeepEval to CI for tests on prompt changes. Set up Langfuse evaluations for production monitoring. From zero to a working quality gate: two to three days. Golden dataset + judge prompt: a couple of hours.