ChatOps for Network Operations: Runbooks That Execute in Slack

ChatOps for Network Operations: Runbooks That Execute in Slack

Turning Incident Chaos Into Executable Runbooks

During a network outage, our team was executing manual CLI commands across 12 devices simultaneously — no one knew what had already been done. I had one engineer in a FortiGate running FortiOS 7.4.3, another engineer checking Cisco access switches, and a third engineer asking in Slack whether anyone had already bounced an uplink on the distribution pair. The ticket had timestamps, but the ticket did not show the real sequence of operational action.

My first assumption was wrong. I thought our problem was lack of documentation, so I started by tightening the incident template. The real problem was that our actions were happening outside the system of record. We documented after the fact, which meant we documented from memory, which meant we argued about what happened when the plant manager wanted the line back up.

The break point came when parallel manual changes during incidents created conflict — two engineers both tried to clear OSPF neighbors on the same device simultaneously. Neither engineer was careless. Both were moving fast, both had terminal access, and both believed they were removing stale adjacency state from a routed manufacturing cell.

That was not teamwork.

I moved our most repeated network incident steps into Slack as executable runbooks. The first runbooks were not fancy: verify routing adjacency, collect interface counters, snapshot firewall sessions, check VPN tunnel state, and run read-only diagnostics against devices. I wanted commands that our team already trusted, with guardrails around who could run them, against which device group, and during which incident channel.

My opinion is that ChatOps only works when it starts with boring operations work, not dramatic automation dreams.

Designing Runbooks Around Network Tasks We Actually Perform

I designed our runbooks around tasks that had a clear operator intent and a repeatable command path. In our environment, that meant FortiGate firewalls on FortiOS 7.4.3, Ubuntu 22.04 jump hosts, Python 3.11 automation workers, and a mix of routed plant networks where timing mattered because PLC traffic did not care about our ticket hygiene.

Each runbook needed a name, a scope, an input schema, a command map, an approval rule, and an output format. I did not let the bot accept raw CLI strings from Slack. That was the line I refused to cross. If an engineer typed /netops clear-ospf device=fw-edge-02, the bot translated that intent into a known command only after checking role, incident channel, device lock, and approval status.

  • Read-only diagnostics ran immediately for approved network engineers.
  • State-changing actions required a second engineer approval in Slack.
  • Device locks prevented simultaneous destructive actions on the same node.
  • Every runbook emitted structured JSON before posting human-readable output.
  • Incident channels became the operational timeline, not just a chat room.

What I didn’t expect was how quickly the team stopped asking, “Did anyone run this yet?” The answer was visible in the channel, attached to the command, the device, the approver, the return code, and the output hash. That changed the room more than the automation itself.

Visibility beats heroics.

I still prefer small runbooks. A 20-step automated recovery workflow looks impressive in a demo, but during a production outage I want five targeted runbooks that are easy to inspect and hard to misuse. My bias is earned from too many nights watching automation fail because it tried to be clever while the network was already unstable.

Building A Slack Bot That Runs Commands Safely

Our Slack bot runs as a Python 3.11 service on Ubuntu 22.04, with Slack Bolt handling slash commands and a worker queue executing device actions. I kept the bot process separate from the command runner because Slack interactions need fast acknowledgments, while network commands can hang, retry, or return partial output when a device is under stress.

The command runner talks to devices through approved libraries and jump-host constraints. For FortiGate work, we used API calls where possible and SSH only where the API did not expose the operational command cleanly. For switch and router checks, we used locked command templates with explicit device inventory references. No one pasted credentials into Slack. No one passed arbitrary shell.

from dataclasses import dataclass
from datetime import datetime, timezone

@dataclass
class RunbookRequest:
    incident_id: str
    user_id: str
    action: str
    device: str
    approved_by: str | None = None

DESTRUCTIVE_ACTIONS = {"clear_ospf_neighbor", "bounce_interface"}

def execute_runbook(req: RunbookRequest) -> dict:
    if req.action in DESTRUCTIVE_ACTIONS and not req.approved_by:
        return {"status": "blocked", "reason": "approval_required"}

    if device_is_locked(req.device):
        return {"status": "blocked", "reason": "device_locked"}

    with lock_device(req.device, owner=req.user_id):
        result = run_known_command(req.action, req.device)

    return {
        "status": "complete",
        "incident_id": req.incident_id,
        "device": req.device,
        "action": req.action,
        "requested_by": req.user_id,
        "approved_by": req.approved_by,
        "finished_at": datetime.now(timezone.utc).isoformat(),
        "return_code": result.return_code,
        "output_hash": result.output_hash,
    }

The code was less interesting than the constraints around it. We added per-device concurrency limits, channel binding, request expiration, and role-based action maps. A runbook launched from the wrong Slack channel failed. A destructive runbook without an incident ID failed. A second destructive request against a locked device failed with a message naming the active operator.

Friction is a feature here.

I do not believe network ChatOps should feel like a general-purpose remote terminal. It should feel like a controlled cockpit with fewer switches, clearer labels, and permanent recording. That opinion has saved our team from turning Slack into a prettier outage amplifier.

Logging Approvals Where The Incident Actually Happens

The audit trail became the real win. Before ChatOps, our ticketing system showed that an incident existed and that someone eventually wrote notes. It did not natively show every operational action in real time. Slack did, once we forced the bot to post structured action cards for request, approval, execution, completion, and failure.

For destructive operations, I required a second engineer approval using an interactive Slack button. The approver could not be the requester. The approval message included the runbook name, device, command intent, incident ID, expected blast radius, and a short rollback note. The bot wrote the same event to our audit store with immutable timestamps and Slack message links.

After implementing ChatOps runbooks, incident change conflicts dropped to zero and mean incident documentation time dropped from 45 minutes to 5 minutes. Those numbers mattered because they reflected less confusion during the outage and less paperwork after the outage. The team still had to think. The bot just removed the guessing.

Five minutes changed behavior.

We also logged failed attempts. That mattered more than I expected because denied actions showed where our runbooks were unclear or where permissions did not match actual duty roles. A rejected command was not just a security event; it was feedback about operational design.

You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.

My opinion is that an audit trail written after the outage is evidence of memory, while an audit trail written during the outage is evidence of work.

Connecting Slack Actions To Change Records Without Slowing Response

We still use change management and tickets. I did not replace those systems, and I would not want to. The difference is that our ticket stopped being the place where engineers tried to reconstruct reality. Instead, the ticket receives links, runbook records, approval events, output hashes, and final incident notes generated from Slack activity.

For emergency changes, the bot creates or updates the change record with the incident ID, device list, runbook names, requester, approver, and execution result. For standard changes, the same runbook can run in a planned change channel with a scheduled window and tighter pre-check requirements. I like that symmetry because it keeps emergency work from becoming a separate culture with weaker habits.

The integration pattern was simple: Slack captured intent and collaboration, the bot executed controlled actions, the audit store preserved raw event detail, and the ticketing system received curated records. That separation kept each system honest. Slack was not the database. The ticket was not the terminal. The bot was not the policy owner.

Boundaries keep tools useful.

The strongest internal objection was that engineers would lose speed. In practice, they gained it. No one had to ask who was logged into what device, no one had to paste command output into a ticket, and no one had to defend an emergency action with half-remembered timestamps the next morning.

My opinion is that change management should absorb operational evidence automatically, not punish engineers for failing to become court reporters during an outage.

Keep Runbooks Small, Visible, And Opinionated

I keep our ChatOps runbooks intentionally narrow. A good network runbook performs one operational job, exposes the decision point clearly, and leaves a trail that another engineer can read under pressure. If a runbook needs a paragraph of explanation in Slack, I split it or redesign it.

I also review runbooks after incidents the same way I review firewall policy changes. If the output was noisy, I trim it. If the approval prompt was vague, I rewrite it. If engineers bypassed the bot and went straight to CLI, I ask why before I blame the process. Sometimes the runbook is missing a parameter. Sometimes the device inventory is stale. Sometimes the automation is slower than a skilled operator, and that deserves an honest fix.

The biggest cultural shift was that our incident channel became the shared operational surface. We could see checks, approvals, failures, retries, and command results without screen sharing three terminals and hoping everyone followed along. That changed who could contribute during an outage, especially engineers who understood the system but were not the person holding the active SSH session.

The record became live.

I do not treat ChatOps as a replacement for network engineering judgment. I treat it as a way to make judgment visible, sequenced, and accountable when the plant floor is waiting. For our environment, that is the difference between a group of skilled people typing fast and a team executing with one shared timeline.

My opinion is that the audit trail is the product. The command execution is just how we earn it.

The best incident record is not written after the outage; it is created one approved action at a time.

External References


·

·