Python Network Config Backup: Automating Multi-Vendor Device Snapshots

Python Network Config Backup: Automating Multi-Vendor Device Snapshots

When The FortiGate Flash Went Bad

A FortiGate suffered a corrupt flash during a power event on our plant floor, and the first call I got was not gentle. The firewall was running FortiOS 7.4.3, it had been stable for months, and then a short utility power disturbance turned into a dead boot path. We had production lines waiting on segmented traffic, vendor VPNs, and OT historian feeds. We recovered from backup in 2 hours instead of doing a full rebuild because automated backups were already running every night.

I remember opening the latest config snapshot, checking the Git timestamp, and feeling the pressure drop in the room. The replacement unit still needed licensing, interface mapping, and validation, but we were restoring known-good intent instead of trying to reconstruct years of firewall policy from memory and screenshots.

Backups changed the incident.

My team had been treating network config backups as operational hygiene, but that event moved them into the same mental bucket as UPS maintenance and tested restore procedures. I do not care how clean a network diagram looks; if I cannot recover the running config quickly, that diagram is decoration. In production, config history is infrastructure.

Retrieving Multi-Vendor Configs With Netmiko

Our backup job runs on Ubuntu 22.04 with Python 3.11 and Netmiko 4.3.0. The environment is mixed: Fortinet firewalls, Cisco IOS switches, Aruba access switches, and a handful of older edge devices that nobody wants to touch until budget season. I started with Netmiko because it gave me predictable SSH handling without forcing us into a heavy automation platform before we had the basics right.

The core pattern is simple: load an inventory, connect with the right device type, issue the vendor-specific show command, normalize the filename, and write the raw text exactly as collected. I avoid over-processing configs at capture time because the backup job should preserve evidence, not interpret it.

from datetime import datetime
from pathlib import Path
from netmiko import ConnectHandler

BACKUP_ROOT = Path("/opt/netcfg-backups")
RUN_DATE = datetime.now().strftime("%Y-%m-%d")

COMMANDS = {
    "cisco_ios": "show running-config",
    "fortinet": "show full-configuration",
    "aruba_os": "show running-config",
}

def backup_device(device):
    command = COMMANDS[device["device_type"]]
    target_dir = BACKUP_ROOT / device["site"] / device["hostname"]
    target_dir.mkdir(parents=True, exist_ok=True)

    with ConnectHandler(**device) as conn:
        output = conn.send_command(command, read_timeout=90)

    backup_file = target_dir / f"{RUN_DATE}.cfg"
    backup_file.write_text(output, encoding="utf-8")

    return backup_file, len(output.splitlines())

What I didn’t expect was how many device quirks showed up only after the job ran for a full week. One platform needed a longer read timeout. One switch inserted paging despite the session preparation. One firewall cluster returned a slightly different header after failover. I stopped pretending all network gear behaves like clean API clients; SSH automation has to be boring, defensive, and vendor-aware. That is my preference.

Keeping Configuration History In Git

Once the files land on disk, I commit them into a private Git repository. I know some teams use object storage or backup appliances, and those can work, but Git gives my team diffs, blame, branchable experiments, and a history model every engineer already understands. When a production VLAN disappears from a trunk, I want to see the exact line change, not browse a zip archive named after a date.

Our repository layout follows sites and hostnames because that matches how we troubleshoot at 2 a.m. A daily scheduled job commits only real changes, so a quiet network does not create noise. The commit message includes the backup date and the job run ID from our scheduler.

Diffs beat memory.

We run a basic local Git flow on the backup host and mirror the repository to an internal Git server. The backup host has no reason to accept inbound management traffic from user networks, and the Git remote is restricted to a service account with write access to that one repository. That design is not glamorous, but it is easy to explain during an audit and easy to rebuild.

  • I keep one file per device per captured day for restore certainty.
  • I also keep a latest.cfg pointer for fast human access.
  • I commit only after verification passes for that device.
  • I tag monthly checkpoints before planned maintenance windows.
  • I restrict repository access because configs contain sensitive topology.

Git is not a backup strategy by itself, but for network configuration history it is one of the most useful tools we run.

Managing Credentials Without Hardcoding Passwords

My first version of the script had credentials in a local YAML file. I knew better, and I still did it because I wanted the job working before the next outage. That was my mistake. The file permissions were tight, but hardcoded secrets age badly, rotate badly, and get copied into places they were never meant to live.

We moved device credentials into a secrets manager and passed short-lived values into the Python 3.11 process at runtime. For a few legacy switches, we still had shared accounts while waiting for TACACS+ cleanup, but those exceptions were visible and tracked. The point was not perfection on day one; the point was removing hidden credential debt from the script.

The worse mistake came later. Three Cisco switches in the backup script were failing silently with wrong credentials, and we discovered this when we needed one of those configs after a failure. The job reported completion because the wrapper caught the exception, logged a warning, and kept moving. I had optimized for continuing the run, but I had failed to make missing backups loud enough.

Silent failure is failure.

You may also find this useful: Check out our guide on Claude API in Production: Rate Limits, Cost Control, and Reliability Engineering for more practical tips.

After fixing credential management and adding per-device backup verification, backup success rate went from 87% to 99.6% across 180 devices. That number mattered because it changed the conversation with operations. We were no longer saying, “the backup job ran.” We were saying, “179 or 180 devices produced verified configs last night, and this one failed for authentication.” That is the standard I trust.

Proving Each Snapshot Was Captured

Verification is where the system became dependable. A config file existing on disk is not enough. I check that the captured output is larger than a minimum size, contains expected vendor markers, does not contain login failure text, and changes the device status in a small SQLite 3.45 database used by the reporting job. For FortiOS 7.4.3, I expect recognizable configuration blocks. For Cisco IOS 15.2, I expect running-config structure, hostname, and interface sections.

I also record line count, byte count, SHA-256 hash, start time, end time, command used, device type, and exception text if anything fails. That metadata helped us find a firewall that was reachable but returning incomplete output during a CPU spike. Without byte and line thresholds, the job would have stored a partial config and smiled at us.

Reports go to our team chat every morning. I want the message short: total devices, verified backups, failed backups, changed configs, and links to the Git diff. If a switch fails two days in a row, it becomes a ticket. If a core device fails once, it becomes a same-day investigation. Network backup monitoring deserves urgency because restore time is decided before the outage starts.

I like simple controls that survive real incidents. In our plant, daily config snapshots, Git history, verified credentials, and loud failures have saved more time than any fancy dashboard. Automation is only useful when I can prove it worked.

Make Restores Boring Before The Outage

The last part of our process is restore rehearsal. Once a quarter, we pick a device class and prove that we can rebuild from the stored config. We do not restore blindly into production, but we validate syntax, compare firmware assumptions, and check whether secrets, certificates, VPN objects, and local accounts need separate handling. That exercise exposed gaps a normal backup report never would have caught.

For FortiGate units, I keep notes tied to FortiOS 7.4.3 because restore behavior can shift between releases. For Cisco switches, I document boot variables, VLAN database handling, and anything that lives outside the running config. For Linux-based network services on Ubuntu 22.04, I back up the application config and the systemd unit files together because one without the other wastes time.

Recovery should feel rehearsed.

The strongest lesson from our corrupt flash event was not that Python saved us. Python 3.11, Netmiko 4.3.0, Git 2.43, and a scheduler were just the parts. The real win was deciding that every production network device needed a verified, version-controlled snapshot before we were allowed to call the environment recoverable. I have no patience now for unmanaged config risk hiding behind uptime charts.

Further Reading: For more in-depth information, refer to the official Fortinet Documentation.

A device without a verified backup is not configured; it is temporarily lucky.