AI-powered incident response with Claude Code for tier-1 automation

AI-Powered Incident Response: How Claude Became Our Tier-1 On-Call

Building Our Claude Triage Pipeline Around Real Alert Context

Our team was getting paged 4-6 times per night for alerts that had documented runbooks — every one of those pages could have been handled automatically. The worst stretch was a Tuesday night when a FortiGate running FortiOS 7.4.3 generated repeated VPN tunnel flap alerts while the manufacturing floor kept running normally. I acknowledged the same PagerDuty incident three times, checked the same tunnel metrics, and followed the same runbook before 4am.

That was the point where we stopped treating AI incident response as a lab idea and started treating it as an operations problem. We built the first version on Ubuntu 22.04 with Python 3.11, a small FastAPI receiver, Claude for reasoning, and a read-only bundle of alert context from Prometheus, FortiAnalyzer, Windows Event Forwarding, and our CMDB. I cared less about making it clever and more about making it boring enough for production.

Sleep mattered.

The pipeline starts with a normalized incident object. We pass alert name, severity, source system, affected asset, last known maintenance window, related changes, prior incidents, and linked runbook IDs. Claude does not get a vague message like “VPN down.” It gets the same bundle I want when I am half-awake: what fired, what changed, what else is noisy, and whether the device is part of a production cell or an office network.

My opinion after running it for a month is simple: AI triage without operational context is just a faster way to write confident guesses.

Linking Claude To Runbooks Without Turning Them Into Prompt Soup

We tried two approaches. The first was inline documentation, where each alert included the full runbook text inside the prompt. That worked for five or six alerts, then became brittle when our runbooks grew, especially for firewall failover, EDR isolation, and PLC network segmentation procedures. The second approach used retrieval-augmented generation, with runbooks chunked by procedure, environment, command risk, and rollback section.

RAG won, but not because it sounded more modern. It won because I could update the runbook without touching the incident pipeline, and because Claude could cite the exact internal step it was following before proposing an action. We stored the runbook index in Postgres with pgvector, tagged each chunk by system owner, and required a minimum relevance score before Claude could use it as authority.

Bad retrieval is worse than no retrieval.

Our runbook chunks had to be written like instructions for a tired engineer, not like policy documents. I added command examples, expected output, rollback checks, and a field called automation_allowed. If that field was false, Claude could explain the next step but could not execute anything. That little field prevented more risk than any fancy prompt language we wrote.

def build_incident_prompt(alert, runbook_chunks):
    allowed_steps = [
        chunk for chunk in runbook_chunks
        if chunk["score"] >= 0.82 and chunk["automation_allowed"] is True
    ]

    return {
        "alert": alert,
        "evidence": {
            "recent_changes": alert.get("recent_changes", []),
            "metrics_window": "15m",
            "asset_criticality": alert["asset"]["criticality"]
        },
        "runbook_steps": allowed_steps,
        "instruction": "Classify, validate, then recommend or execute only approved low-risk actions."
    }

I still prefer retrieval over giant prompts for incident response, because stale documentation is already dangerous and hiding it inside application code makes it worse.

Putting Guardrails Around Remediation Before Claude Touches Production

The first version ran remediation commands without confirming the alert was real — it restarted a service that was actually healthy during a monitoring glitch. That was my mistake. The alert said an ingestion worker was down, the runbook said to restart it, and our automation did exactly that while the host metrics showed the process was alive and processing normally.

What I didn’t expect was how often the AI made a reasonable decision from bad evidence. The failure was not model behavior; it was my control plane. I had allowed a single monitoring source to authorize action, and in a manufacturing environment where noisy sensors, stale SNMP polling, and firewall log delays are normal, that was reckless.

Trust needs friction.

We changed the remediation model after that incident. Claude now has to classify the alert, verify it against at least two evidence sources when available, check the runbook action class, and pass a policy gate before anything runs. Low-risk actions include collecting diagnostics, opening a ticket, suppressing duplicate notifications, or restarting a non-production collector. Medium-risk actions require human approval in Slack or PagerDuty. High-risk actions are never automatic in our environment.

  • Read-only diagnostics are allowed for approved assets and service accounts.
  • Restart commands require confirmed symptoms from metrics and logs.
  • Firewall, IAM, and EDR containment actions require human approval.
  • Every action needs a rollback note before execution.
  • Claude must explain why no action was taken when evidence is weak.

My strongest view here is that an AI on-call system should earn execution rights one boring, audited action at a time.

Wiring PagerDuty And OpsGenie For AI-First Response

We did not replace PagerDuty or OpsGenie. We put Claude in front of them as a Tier-1 responder with a short leash. PagerDuty still owns escalation policy, schedules, audit history, and human notification. OpsGenie still handles some plant-specific teams that had not migrated yet. Claude receives the webhook first, enriches the incident, decides whether it is documented and safe, then either resolves, annotates, or escalates.

You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.

The useful pattern was not “AI closes tickets.” The useful pattern was “AI writes the first incident update before anyone wakes up.” If Claude cannot resolve the alert, it still adds asset context, matching runbook sections, recent deployments, dashboard links, and the exact validation checks it already performed. That turns a 3am page from a cold start into a handoff.

Context is the first remediation.

For PagerDuty, we used event orchestration to route known alert types into the AI triage service, then posted notes back through the incident API. For OpsGenie, we used alert webhooks and added a custom field called ai_disposition with values like auto_resolved, needs_approval, and human_required. We also added a timeout rule: if Claude did not finish within 90 seconds, the page went to the normal human rotation.

That timeout is non-negotiable in my environment, because a slow automation path during an actual outage is worse than no automation at all.

Measure What Actually Changes On Call

We measured three things from day one: mean time to acknowledge, auto-resolution rate, and false-positive remediation. We also tracked unresolved incidents because I did not want the team celebrating fewer pages while hidden problems piled up. On-call page volume dropped 67% in the first month, with no increase in unresolved incidents. That was the first metric that made our plant leadership pay attention.

The before and after was sharp. Before Claude triage, our overnight average was 5.1 pages per shift. After the first month, it was 1.7 pages per shift. Human MTTA stayed around 18 minutes for escalated incidents, while AI MTTA averaged 2.4 minutes because it never had to wake up, find a laptop, connect to VPN, and remember which dashboard mattered.

Numbers changed behavior.

False-positive remediation became the metric I watched hardest after my early mistake. We counted every action Claude took when the underlying condition was later judged invalid, stale, or harmless. The acceptable target was zero production-impacting false remediations, not “low.” I would rather wake an engineer for ten ambiguous alerts than let automation make one confident change to the wrong production system.

My final opinion is that Claude belongs in Tier-1 incident response when we constrain it like a junior engineer with perfect recall, fast reading, and no production ego.

Further Reading: For more in-depth information, refer to the official Fortinet Documentation.

The win was not replacing my team; the win was letting my team sleep through the alerts our runbooks already knew how to handle.