We Ran 18,000 Test Calls Across 297 Agent Data Capabilities. 49% Failed.

DEV Community

Petter_Strale

Apr 16, 2026, 05:07 AM

If you're building AI agents that call external data sources, you probably assume the API will return what it promises. We tested that assumption. Over 6 days (April 9–14, 2026), we ran 18,449 automated test calls across all 297 capabilities on Strale — our platform for verified agent data. Company registries, compliance checks, financial validation, web scraping, document extraction, developer tools. 49% of those calls hit some kind of failure. Not timeout. Not rate limit. Real failures — wrong data, broken schemas, empty responses, upstream outages. Here's the breakdown. Category Calls Failed Rate File conversion 275 190 69% Competitive intelligence 328 213 65% Utility 270 170 63% Data extraction 6,665 4,104 62% Company data 236 146 62% Data processing 1,143 670 59% Compliance 439 256 58% Document extraction 99 55 56% Web intelligence 416 215 52% Monitoring 387 191 49% Developer tools 1,705 816 48% Agent tooling 201 96 48% Web scraping 604 269 45% Security 365 163 45% Validation 2,332 1,033 44% Financial 427 169 40% Trade 29 11 38% Web3 434 119 27% Text processing 110 22 20% Content writing 80 16 20% Finance (algorithmic) 1,404 1 0% The pattern is clear: the further you are from structured, deterministic computation, the more things break. Pure algorithmic capabilities — IBAN validation, tax ID checks, SWIFT message parsing — failed at essentially 0%. They don't depend on external services, don't scrape websites, don't call third-party APIs. Scraping-based capabilities (company registries, competitive intelligence, web extraction) failed 45–69% of the time. Most scraping-based capabilities depend on HTML structure that the source controls. When a government registry redesigns their results page, our scraper returns empty data or misparses fields. This happened during the test sweep with at least 3 EU company registries. The capability didn't throw an error — it returned a valid-looking but wrong response. The only thing that caught it was our schema validation layer. Several capabilities call third-party APIs that rotate auth tokens on schedules we don't control. When a token expires, the upstream returns a 401. The capability wraps this as a generic failure, and the agent has no way to distinguish "this data doesn't exist" from "the auth is broken." We now detect this pattern in our reliability scoring — repeated 401s from the same upstream trigger a circuit breaker and a score downgrade, so agents can see the degradation before they rely on the capability. At least 4 external APIs in our stack have rate limits that aren't in their documentation. You hit a wall at 10 requests per minute, but the docs say the endpoint is unlimited. The failure mode is a 429 with no Retry-After header. We discovered these limits during the test sweep because we were calling at higher-than-normal frequency. In normal usage they might never trigger — but an agent running a batch job absolutely will. Every capability on Strale gets two scores, computed from real test runs: Quality Profile (QP) — correctness, schema stability, error handling, edge case coverage. This excludes upstream failures. It measures: does our code handle the data properly? Reliability Profile (RP) — current availability, rolling success rate, upstream health, latency. This includes upstream failures. It measures: will this capability work right now? These combine into a Strale Quality Score (SQS) from 0–100. Current distribution across our 297 capabilities: 86% score A (SQS ≥ 80) 7% score B (60–79) 6% score C (40–59) 1 capability at D (below 40) The ones scoring C or D aren't hidden. They're published with full detail on what's degraded and why. Every capability has a public quality page at GET /v1/quality/:slug — free, no auth required. If you're building agents that call external data sources, three things matter: Architecture determines reliability more than retry logic. An algorithmic IBAN validator will always beat a scraping-based company lookup. Choose capabilities that match your reliability requirements, not just your feature requirements. Quality signals should be part of the data contract. Your agent shouldn't have to guess whether the data it received is trustworthy. The quality score should come with the response — not as a separate lookup, not as documentation you read once. Test continuously, not just at integration time. The capability that worked perfectly when you built the integration will break when the upstream changes. If you're not testing continuously, you'll find out from your users. We run 1,805 test suites continuously across 297 capabilities specifically because we learned this the hard way. 5 capabilities are free with no signup — iban-validate, email-validate, dns-lookup, json-repair, url-to-markdown. Just POST to /v1/do. For everything else: strale.dev/docs The quality endpoint is public: GET /v1/quality/{slug} — check any capability's score before you use it.