Introducing ARFBench: A time series question-answering benchmark based on real incidents

Machine Learning Blog | ML@CMU | Carnegie Mellon University

Stephan Xie

Apr 27, 2026, 12:46 PM

More than a trillion dollars are lost every year due to system failures. To resolve them, engineers must troubleshoot outages quickly. An important task in incident response involves analyzing observability metrics, or time series data that snapshot the health of software systems. For example, an engineer for a service may use Datadog to answer questions like “When did latency start increasing?” and “What metrics outside of latency are also behaving abnormally?” to localize the root cause of the anomalous behavior. These time series question-answering (TSQA) tasks are essential for engineers, and present challenging and necessary tasks for SRE models and agents to perform. In this work, we explore the degree to which AI models can perform TSQA tasks. To this end, we’re excited to introduce the Anomaly Reasoning Framework Benchmark (ARFBench), a TSQA benchmark derived from real internal incidents at Datadog, using Datadog’s own internal telemetry (Figure 1). In this blog post, we’ll present three key takeaways from our benchmarking experiments: Existing models struggle: Leading LLMs, vision-language models (VLMs), and time series foundation models (TSFMs) have substantial room for improvement on ARFBench. Hybrid models help: We introduce a new hybrid TSFM-VLM model that yields comparable overall performance to top frontier […]