AI News Hub Logo

AI News Hub

IRAS: Building an Autonomous AI Agent for Incident Response

DEV Community
Krishna shakula

IRAS: Building an Autonomous AI Agent for Incident Response Incident response is broken. When alerts fire at 3 AM, on-call engineers wake up to handle routine triage, root cause analysis, and remediation planning—work that doesn't require human judgment, just time and attention. IRAS solves this by automating the entire incident response workflow with an autonomous AI agent that keeps humans in control. Most on-call incidents follow a predictable pattern: Alert fires Engineer wakes up, triages the alert Engineer investigates root cause Engineer creates a remediation plan Engineer executes the plan Engineer writes a post-mortem For routine incidents (disk full, memory leak, failed job retry), steps 1-4 don't require human judgment. They require pattern matching and analysis—exactly what AI is good at. Yet engineers still get paged. IRAS is an autonomous AI agent built on production-grade technology: FastAPI for HTTP endpoints LangGraph for multi-step agentic workflows Pydantic AI for structured outputs and validation Claude (Anthropic) for reasoning and analysis Python for the entire stack When an alert fires, IRAS executes a fully autonomous workflow: Alert → Triage → RCA → Remediation Plan → Post-Mortem → Human Approval Each step is handled by Claude with structured outputs validated by Pydantic. The entire workflow is orchestrated by LangGraph as a state machine with approval gates. Key metric: Sub-2-minute incident resolution from alert to remediation plan. IRAS doesn't execute remediation automatically. Every step requires human approval: Triage approval: Confirm the incident classification RCA approval: Confirm the root cause analysis Remediation approval: Approve the remediation plan before execution Post-mortem approval: Review the generated post-mortem AI does the heavy lifting. Humans stay in control. IRAS isn't a prototype. It's built for production: 99% test coverage with 292 passing tests Zero external test dependencies—mock clients included for local development Integrated observability: Logging, PagerDuty, Slack support Fallback mock clients: Test without external services Docker-ready: Run locally or in production The test suite includes: Unit tests for each workflow step Integration tests for the full incident response workflow Mock PagerDuty and Slack clients for isolated testing No external service dependencies Run the full test suite locally: pytest --cov=iras --cov-report=html IRAS only requires an Anthropic API key: git clone https://github.com/krishnashakula/IRAS cd IRAS export ANTHROPIC_API_KEY=your_key_here docker-compose up The mock clients are enabled by default, so you can test the full workflow without PagerDuty or Slack. IRAS is structured as a LangGraph state machine: from langgraph.graph import StateGraph # Define incident state class IncidentState(TypedDict): alert: Alert triage: TriageResult rca: RCAResult remediation_plan: RemediationPlan post_mortem: PostMortem approvals: Dict[str, bool] # Build workflow graph = StateGraph(IncidentState) graph.add_node("triage", triage_node) graph.add_node("rca", rca_node) graph.add_node("remediation", remediation_node) graph.add_node("post_mortem", post_mortem_node) # Add approval gates graph.add_edge("triage", "approval_triage") graph.add_edge("approval_triage", "rca") # ... more edges Each node uses Claude for analysis and Pydantic AI for structured outputs. Mean Time To Resolution drops dramatically. Routine incidents get analyzed in 2 minutes instead of 30 minutes. On-call engineers stop getting paged for incidents that don't require human judgment. Only serious incidents or approval decisions wake them up. Every action requires human approval. AI is a tool, not a replacement. Automatic post-mortem generation means every incident gets documented, even routine ones. IRAS integrates with: PagerDuty: Fetch alerts, update incident status Slack: Send notifications, get approvals Mock clients: Test without external services Incident response is a solved problem for routine incidents. The analysis is predictable. The remediation is known. The only variable is human approval. IRAS automates the predictable parts and keeps humans in control of the decisions. For on-call engineers, this means: Fewer 3 AM wake-ups Faster incident resolution Better post-mortems More time for strategic work IRAS is open source and production-ready. Check it out: https://github.com/krishnashakula/IRAS Built with Python, FastAPI, LangGraph, Pydantic AI, and Claude. 99% test coverage. Zero external test dependencies. Only requires an Anthropic API key. Start automating your incident response today.