AI News Hub Logo

AI News Hub

Why every data quality tool tells you what broke — but leaves you alone to figure out why

DEV Community
Shiva

Most data quality tools describe what the error is. None of them describes why. Last year, I found that out the hard way — I opened a notebook, ran some queries, dug through pipeline logs, and eventually traced it back to a test account that had been deleted without cleaning up its associated orders. The fix took ten minutes. Finding the cause took three hours. What bothered me wasn't that the tool missed it — it caught it. What bothered me was that the tool handed me a one-line error and expected me to do all the detective work myself. I've used Great Expectations, Soda Core, and dbt tests across different teams. They're all good tools. But they all answer the same question: did this check pass or fail? That's genuinely useful. You know something is wrong. But knowing something is wrong is only the first step, and in my experience it's the easy step. The hard part is what comes after the alert fires. You open a notebook and query the failing table. You look at the bad rows and figure out what they have in common. You check pipeline logs to find when the bad data arrived. You trace it upstream to find which source or transform wrote it. Then you decide what to do — fix the data, alert a source team, add a guard. None of the tooling helps with any of that. You get FAILED and then you're on your own. For the engineer who built the pipeline, this is annoying but manageable. For whoever is on call at 2am rebuilding context from scratch, it's genuinely painful. Here's the same foreign key failure, described the way a senior engineer would explain it to you: Rule: orders_customer_fk (critical) Table: orders Failed: 3 rows Explanation: 3 orders reference customer_id=99 which does not exist in the customers table. Downstream revenue attribution for these orders will be silently dropped in any JOIN-based report. Likely cause: customer_id=99 appears to be a test account that was deleted from the customers table without cleaning up associated orders. Recommended action: 1. SELECT * FROM orders WHERE customer_id = 99 2. Check customers table: was id=99 recently deleted? 3. If test data: DELETE FROM orders WHERE customer_id = 99 4. Add a cleanup job or FK constraint to prevent recurrence Proposed SQL: DELETE FROM orders WHERE customer_id NOT IN (SELECT id FROM customers); Same failure. Completely different experience. The first output stops you. The second gives you a path forward. The idea is straightforward — keep the deterministic rule engine (fast, free, reliable) and add an LLM layer that only runs when rules fail. The LLM doesn't replace the validator. It interprets the failure. It receives the rule definition, the failing rows, the table schema, and any common causes you've documented — and produces the explanation you'd normally piece together manually. The pipeline runs like this: validate → classify → diagnose → root cause → SQL fix → report Tables run in parallel. For each failure, a classifier decides whether it's worth an LLM call based on severity and failure rate. Only failures that cross the threshold get diagnosed. If everything passes, nothing is charged. Every LLM call is logged to a local SQLite audit trail — the exact prompt, response, cost, and latency. Nothing is a black box. I built this as Aegis DQ, an open-source Python framework. Here's what a full run looks like against a demo database with intentional dirty data: ╭──────────────── Validation Summary ─────────────────╮ │ Rules checked │ 12 │ │ Passed │ 1 │ Failed │ 11 │ │ Pass rate │ 8% │ Cost │ $0.005576 │ ╰─────────────────────────────────────────────────────╯ LLM Diagnoses orders_customer_fk → Order placed with customer_id=99 that does not exist. Likely cause: customer deleted or test record not cleaned up. Action: Verify customer_id=99; check recent deletions. products_sku_unique → Duplicate SKU-001 — two products share the same identifier. Likely cause: duplicate import from supplier feed. Remediation SQL orders_status_valid UPDATE orders SET status = 'SHIPPED' WHERE status = 'DISPATCHED'; products_price_positive UPDATE products SET price = ABS(price) WHERE price 0" - logic: type: between min_value: 0 max_value: 0.5 - logic: type: regex_match pattern: "^[^@]+@[^@]+\\.[^@]+$" Generated rules come out stamped status: draft — you review them, promote the ones you want to enforce to active, and commit to version control. The LLM handles the boilerplate. You handle the judgement. If you have an existing Great Expectations suite or dbt tests, there's no reason to replace them. Aegis has a dbt manifest parser that converts your existing tests to Aegis rules if you want to layer diagnosis on top — but that's additive, not a migration. If you need a business-user UI or an enterprise support contract, Aegis isn't the right fit today. It's a CLI and Python framework, not a SaaS platform. The gap between "what failed" and "why it failed" is where a lot of data engineering time quietly disappears. Tooling has gotten very good at detection. Diagnosis is still mostly manual. That feels like the wrong place to spend engineering hours. GitHub: https://github.com/aegis-dq/aegis-dq Install: pip install aegis-dq Docs: https://aegis-dq.dev Happy to answer questions in the comments — especially curious what people are using for DQ today and what pain points I might have missed.