What Happens During an Incident (Part 4)
In Part 3, we separated signals on purpose: metrics tell you where to look logs and traces tell you what happened audit tells you what can be proven later. This article is an incident story. The examples are fictional, but the dynamics are real. An incident is the exact moment where panic starts: Just ship everything to one place. Just add the tenantId to every metric. Just give everyone access so we can debug faster. And for the sake of the article and AI, I bring in AWS DevOps Agent. as an incident capability that can reduce cognitive load by correlating telemetry, code and deployment context, and showing its investigation steps transparently. References: About AWS DevOps Agent: https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent.html DevOps Agent Incident Response (investigation timeline, root cause, mitigation plans, gaps): https://docs.aws.amazon.com/devopsagent/latest/userguide/devops-agent-incident-response.html Assumptions The application is setup in this way: 3 continents (EU / US / APAC) 2 regions per continent EventBridge bus per region A lot of Lambda functions per region Data residency constraints Other assumptions already solved: an AWS Organization with SCPs and guardrails CloudTrail trails delivering to S3 with retention policies Structured application logs Metrics and alarms for key services Traces enabled for critical flows Someone writes: CorrelationId: c-9f3a... blah blah. It worked yesterday. When operating globally, this message is missing clues: We need to identify which region We need to stay inside the right boundary We need to answer fast without granting dangerous access We need to produce an explanation later (post-mortem, evidence) This is where trust is the architecture that becomes operational. I need a fast answer to 2 questions: 1) Is this global or regional? 2) Is this a customer-specific failure or a systemic failure? Based on the message, I start looking for the metric that represents the health of that service. I would look into CloudFront 5xx error rate (if requests fail at the edge) API error rate or latency (if APIGW/ALB) Downstream service errors (if the API is okay but something else fails later) So I look at: Regional behaviour (EU vs US vs APAC backends) Last 15 minutes vs last 24 hours Error rate and latency This is a great moment for AWS DevOps Agent DevOps Agent is designed to: Learn your resources and relationships, Build a topology graph, Introspect CloudWatch telemetry through the configured AWS account access, Produce an investigation timeline and root-cause summaries when an investigation runs. 09:18 Let's say metrics show: EU error rate rising US/APAC normal Service X is impacted Time window: since ~09:05 A known correlationId from customer: c-9f3a... At this point, I have your coordinates: Region boundary: EU Capability boundary: invoice approval 09:22 Even with the correlationId, I still need a place to start. I usually start from where the request starts The API handler Lambda The background process The first event or Step Functions Once the entry point is established, I look for the log: { "level": "info", "msg": "Invoice approval requested", "correlationId": "c-9f3a...", "tenantId": "tenant-42", "region": "eu-west-1", "service": "invoice-api", "operation": "invoice.approve" -> command } Now I can answer a critical question fast: Did we actually receive the request? Did we enqueue/publish the next step? Did we reject it immediately? Suppose I see: { "level": "info", "msg": "Published event InvoiceApprovalRequested", "correlationId": "c-9f3a...", "eventBus": "eu-invoices", "service": "invoice-api" } This is important: as the request entered the system and made it into the workflow. At this point in the investigation, the instinct is to look for a trace. I have the correlationId. The system is instrumented. Tracing is enabled. So the obvious question becomes: Do we have a trace for this request? The answer is: maybe. Tracing in production is almost always sampled. The exact request that failed may not have been captured, especially if the system is high-volume or if sampling rates are intentionally low to control cost and data volume. When a trace does exist for the correlationId, I can see how the request moved across services, where time was spent, and which dependency failed or slowed down. But I'm usually not that lucky, so I tried in a different way: Look for traces of similar requests in the same time window. Move on and rely on logs Now I have an idea: In the EU, serviceX is timing out The next question is: Why did it start at 09:05? And immediately after: Did we deploy new code? Was there a configuration rollout? Did a feature flag change? If nothing was deployed, that does not end the investigation. I start looking into dependency behaviour, load patterns, or environmental drift. This is why the audit workflow often looks like: CloudTrail delivered to S3 queried later (often Athena) correlated by time, principal, and resource In this moment, AWS DevOps Agent can be genuinely useful: It maintains an investigation timeline It can report investigation gaps It can propose mitigation plans with stages Note: Let's say I confirm a configuration change, and not a random problem where AWS decided to have some problems. Post-mortem time I should be prepared to answer: What happened? Why did it happen? Why did it take us that long to detect? What was confusing? What data did we wish we had? What guardrail would have prevented it? What should we automate or change? Again, this is where AWS DevOps Agent fits, as it includes a prevention feature that analyses multiple incidents and produces recommendations. By the way AWS best practise for AWS DevOps Agent: Application-specific Agent Spaces, Read-only access to shared dependency accounts, Tagging shared resources to identify which applications use them, Turning cross-team escalation procedures into runbooks. Conclusion If Part 3 was about why signals are different, this article is about what happens when you actually need them. Metrics help you get an idea of where things are happening Logs and traces let you understand what really happened Audit and identity give you something you can stand behind later. Incidents are the moments when the missing process shows up, while Post-mortems are where that pain becomes, hopefully, better systems.
