How to Test Your UCP Implementation with AI Agents

DEV Community

Benji Fisher

May 15, 2026, 05:11 AM

You ship a UCP manifest. The validator returns green. The schema parses cleanly. Every required field is present, every URL resolves, every transport responds. You declare the work done and move on. Three weeks later, you find out your store has been quietly failing every agent shopping session. The cart endpoint accepts adds but rejects checkouts. A specific variant ID throws a 400 on update_cart. The agent reaches ready_for_complete and stalls because your payment handler doesn't recognise the token format. None of these issues showed up in static validation. All of them block real users on agent-mediated flows. This post is about how to actually test your UCP implementation — not as a schema document, but as a runtime surface that real frontier agents have to operate against. The short version: schema validation is necessary but not sufficient. The long version is the rest of this post. A UCP validator (including ours, the validator at ucpchecker.com/ucp-validator) checks structural things: Manifest is valid JSON Required fields are present (spec, services, signing_keys, etc.) Declared spec version is one we recognise Transport endpoints return non-error responses Schema URLs resolve Capability namespaces match the spec catalogue Those are the things you can verify without actually running an agent flow against the store. They're table-stakes, and the UCP Score bakes them into the structural-conformance dimension of its grade. What static validation doesn't catch: Whether update_cart rejects valid variant IDs intermittently Whether the cart endpoint's success response contains the line items it claims to contain Whether the checkout flow surfaces the buyer-specific payment instruments your customer can actually use Whether your search_catalog returns more than 8 KB of HTML in a description field that crashes Claude's tool-calling layer Whether two different models pick the same variant ID for "Medium" against your product (the variant-data problem we cover separately) Whether the agent can recover when one of your tool calls returns a 500 mid-flow These are runtime properties. They only surface when you run an actual agent against an actual checkout. And they're where the gap between "store passes validation" and "agent can buy" lives. The April State of Agentic Commerce report sized that gap concretely: of 4,014 verified UCP stores, only 9 delivered a flawless end-to-end agent experience. A 0.2% flawless rate against a 98%+ conformance rate. The runtime gap is the gap. The right way to test UCP is not "validator or no validator" — it's three layers, each catching a different class of problem, in increasing order of cost and fidelity: Layer Tool Catches Cost 1. Schema validation /ucp-validator Manifest parse errors, missing required fields, malformed URLs seconds, free 2. Capability score /score Surface signals, declared capabilities, transport reachability, robots/sitemap hygiene seconds, free 3. Live agent eval UCP Playground Variant resolution, cart/checkout shape, error recovery, multi-model behaviour, attribution flow dollars per session, paid Each layer feeds the next. If layer 1 fails, layer 2 has nothing to score. If layer 2 reports gaps, layer 3 will find them magnified in real agent runs. Skipping layers wastes layer 3's time on bugs the cheaper layers would have caught — that's the case for running them in order rather than going straight to live agents. Most teams stop at layer 2. Stopping at layer 2 is what produces the 99.8%-conformant / 0.2%-flawless gap. A clean Score gets you to "the agent has a fair chance." A clean Score plus a clean eval gets you to "the agent reliably completes the flow you care about." Layer 3 is where most readers are unfamiliar, so this section walks through what running an agent test against your own store actually involves. The shape: you point a frontier agent (Claude, GPT, Gemini, Grok, Llama — whichever model you want to evaluate against) at your store's UCP manifest endpoint and give it a multi-turn shopping prompt. The agent does what an agent does — discovers your tools via the manifest, calls search_catalog against your products, evaluates the results, picks something, calls update_cart, navigates checkout. The framework records every tool call, every response, every model decision, the full token-by-token event stream. At the end of the session you get a structured report: Did the agent reach checkout_reached (full transaction completion)? Or did it stop at cart_created, search_only, or failed? How many tool calls did it make? How many succeeded? Which ones errored? How many tokens did the model consume? How long did the session take? If the agent failed, why? That's the data layer 1 and layer 2 can't produce. Schema validation tells you what your store says; agent eval tells you what an agent does with what your store says. They're answering different questions. For most stores, the first eval session is uncomfortable. The agent picks the wrong variant. Or it adds something to the cart and then stalls because the response shape isn't quite what it expected. Or it reaches ready_for_complete and can't move forward because your payment-handler declaration doesn't match what the agent has been trained to handle. Each of those is a fix you can make, and each fix lifts your real conversion rate the next time an actual user-facing agent shops your store. A useful pattern from the Playground 1,000-session dataset: the same store gets meaningfully different outcomes across different models. A store that completes checkout 65% of the time on Claude Sonnet 4.5 might complete only 18% of the time on GPT-5.2 — the same UCP implementation, the same shopping prompt, just a different model. That spread isn't because one model is "better." It's because each frontier model has its own quirks in how it handles tool calls, schemas, error responses, and ambiguous data. Models differ on: How they handle empty arrays vs missing fields Whether they follow up on a 4xx response or move on How aggressively they retry failed tool calls How they parse multi-line strings in description fields Whether they pass through optional metadata fields verbatim The real-world implication: your customers don't all use the same agent. Some use ChatGPT-routed flows; some use Anthropic's; some use Google AI Mode; some use a custom agent built on Llama. Testing against just one model means catching only the bugs that one model surfaces, while shipping silent failures to everyone using a different one. Multi-model coverage is what gets you from "this passes for our internal demo" to "this works for real customer traffic." UCP Playground supports head-to-head testing across 15+ frontier models. The comparison view lets you run the same store against any two models on the same workload. We'd suggest at minimum testing against: One Anthropic model (Claude Opus or Sonnet) One OpenAI model (GPT-5.2 or GPT-4o) One Google model (Gemini 3.1 Pro or 2.5 Flash) Three models cover most of the deployed-agent universe. If any of the three behaves badly against your store, you have a real problem worth fixing before more traffic arrives. Manual eval is fine for one-off audits. If you're shipping changes regularly, you want this in CI. The Playground exposes a headless API for exactly that: POST /api/v1/collections — define a test (sequence of prompts + models + stores) POST /api/v1/collections/{id}/run — trigger the test GET /api/v1/collection-runs/{id} — poll status + results The pattern most teams ship first: a deploy-time test that triggers an eval after every UCP-related code change, asserts on key metrics, and fails the build if any of them regress. A reasonable assertion shape: # .github/workflows/ucp-eval.yml - name: Run UCP eval run: | curl -X POST $PLAYGROUND_API/v1/collections/$COLLECTION_ID/run \\ -H "Authorization: Bearer $PLAYGROUND_TOKEN" # Poll, then assert: # - checkout_rate >= 80 # - errors.total == 0 # - avg_duration_ms < 30000 Same shape as Lighthouse CI for web performance. A regression catch you bolt onto your pipeline rather than rediscover in production. The UCP Playground Evals launch post walks through the full pattern with a worked example. If you're starting from a fresh UCP implementation: Run the validator against your manifest. Fix any structural errors. This is the cheapest layer; do it first. Get a UCP Score for your domain. Aim for B+ (70+) before moving to live testing. Below that, you have surface-level gaps that'll dominate the eval results and waste your test budget. Run a Playground eval against your store with two different frontier models on a single shopping sequence. Fix whatever fails. Common first-time failures: variant-data ambiguity, response-shape inconsistencies, tool argument validation. Expand to three models once your single-model baseline works. Multi-model coverage is what catches the long-tail issues. Wire the eval into CI once your implementation is stable. From this point on, every code change that touches UCP runs against real agents before it ships. If you've already got a UCP implementation in production and are trying to figure out why agents aren't completing checkouts, skip step 2 and go straight to step 3. The eval will show you the specific failure mode, and you can backfill the score work later. A store that's passed all three layers cleanly looks like this: Validator: green Score: A grade (85+) across Discovery, Conformance, and Capability Coverage Eval: 80%+ checkout rate against Claude Sonnet 4.5, Gemini 3 Flash, and one other model of your choice; <5s average tool-call latency; zero categorised errors across at least 20 sessions That's the bar. The State of Agentic Commerce is tracking how many stores hit that bar — currently fewer than 1% of verified stores. The work to get from 99% conformance to 1% bar-clearing is mostly testing work. Validator (free, instant): ucpchecker.com/ucp-validator Score (free, instant): ucpchecker.com/score Live agent eval (paid per session): ucpplayground.com/evals Multi-model comparison view: ucpplayground.com/models CI-ready eval API: documented at ucpplayground.com Schema validation is necessary. It is not sufficient. The agents your customers use will run real flows against your store, and the only way to know whether those flows succeed is to run them yourself first. Test before they do.