Design the Management Plane to Stand Alone
During a core switch failure, we had no out-of-band access to 60 servers — recovery required physical data center access and 4 additional hours of downtime. I was standing in a manufacturing data center with production supervisors waiting on MES recovery, and my team could not even power-cycle the servers that hosted the tools we needed.
My first assumption was wrong. I believed our IPMI/iDRAC network was out-of-band because it used a separate VLAN, separate IP range, and separate firewall policy on FortiOS 7.4.3. Our IPMI/iDRAC network was connected through the same core switches that failed — the management plane depended on the very infrastructure it needed to manage.
That hurt.
I now treat physical separation as the first design requirement, not a nice add-on. Our dedicated OOB switches use separate power, separate uplinks, separate cabling paths, and a routing table that does not require the production core to be alive. Logical separation matters, but VLANs alone are not survivability; they are convenience wrapped in 802.1Q tags.
My design rule is simple: if a production core, aggregation switch, firewall pair, or hypervisor cluster can fail and remove my access to management ports, I have not built out-of-band management. I have built a prettier dependency graph, and I do not trust pretty dependency graphs in a plant outage.
Separate IPMI, iDRAC, and iLO From Production Access
We standardized server management around IPMI, Dell iDRAC, and HPE iLO because each one gives my team remote power control, virtual media, hardware health, and console access below the operating system. That matters when Ubuntu 22.04 is wedged, a NIC driver update goes sideways, or a Windows server never reaches the login screen after patching.
I keep those interfaces on a dedicated management subnet with strict access control. My jump host runs hardened Ubuntu 22.04, Python 3.11 tooling for inventory checks, and no general internet browsing. Authentication goes through named admin accounts, not shared passwords taped inside a password vault note from six years ago.
Management ports are powerful.
The trap I see most often is treating IPMI like a harmless side channel. In our environment, IPMI can mount ISO images, reset servers, change boot order, and expose sensor data that attackers would love during reconnaissance. I block production workstation access, log every session, and allow only a short list of security and infrastructure admin sources.
- Dedicated OOB switch ports for every server management interface
- Separate OOB addressing with no default route through production core switches
- Named administrative accounts with MFA on the jump path
- Firmware baselines tracked for iDRAC, iLO, and BMC controllers
- Firewall rules reviewed whenever a new rack or line system is added
What I didn’t expect was how many “temporary” exceptions had become permanent. Once my team reviewed the rulebase, we found vendor VPN access to management addresses that no one could tie to an active support contract, and my opinion is that unmanaged exceptions are just delayed incidents.
Add Console Servers for the Gear Without a BMC
Servers were only half the problem. Switches, routers, firewalls, storage controllers, and UPS systems still needed console access when their management IPs disappeared. We deployed console servers with serial connections into the network racks, then connected those console servers only to the OOB network.
Serial access feels old until the fancy path breaks. During one FortiGate maintenance window on FortiOS 7.4.3, a routing change cut off the web interface and SSH at the same time. The serial console let us watch the commit behavior, revert the bad route, and avoid walking across the plant floor during a night shift change.
Serial saved us.
I label both ends of every console cable, record the port mapping in source-controlled inventory, and test access quarterly. A console server with stale port labels is worse than annoying because it steals time when everyone is already tense. Manufacturing outages do not wait politely while I guess whether port 18 is the core switch or the old SAN fabric.
#!/usr/bin/env bash
set -euo pipefail
HOSTS=("idrac-rack1-01" "ilo-rack2-04" "console-rack3-01")
for host in "${HOSTS[@]}"; do
if timeout 3 bash -c "cat < /dev/null > /dev/tcp/${host}/22"; then
echo "${host}: ssh reachable"
else
echo "${host}: ssh unreachable"
fi
done
I do not pretend that a short Bash check replaces monitoring, but I like simple tools that fail loudly. The best console design is boring, documented, and tested before the outage, and I prefer boring access paths over heroic recoveries.
You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.
Give the OOB Network Its Own 4G LTE Exit
The dedicated OOB switches fixed the core-switch dependency, but my team still had a site-edge dependency. If the WAN circuit or perimeter firewall failed, remote staff could lose access to the management network even though the OOB fabric inside the data center stayed healthy. That was better, but not good enough.
We added a 4G LTE router as a controlled backup path into the OOB network, with VPN access restricted to our security team and infrastructure admins. The LTE path does not carry production traffic, does not provide broad LAN access, and does not become a shadow remote-access platform for convenience requests.
The backup path earned its rack space.
After deploying a dedicated OOB network with 4G LTE backup, the next core switch failure was resolved remotely in 40 minutes instead of requiring physical access. Before that change, the same class of failure forced a trip to the data center and added exactly 4 additional hours of downtime. That is the difference between an ugly maintenance event and a production incident that gets discussed in the morning operations meeting.
I also learned to monitor the LTE path like production infrastructure even though it should rarely carry traffic. SIM status, signal strength, data usage, VPN reachability, and certificate expiration all go into our normal alerting. A backup path that has not been tested recently is not a backup path; it is a hope with an invoice.
Build the Network That Survives the Failure
My current standard is blunt: OOB must survive the failure domain it is meant to repair. If the failed component can take down access to its own management interface, my team has designed a dependency, not an escape route.
We now review OOB as part of every rack build, firewall refresh, and server purchase. I ask where the management cable lands, which switch powers it, which route carries it, which account can reach it, and which path remains when production is dark. Those questions slow down design meetings, but they speed up recovery when the plant is waiting.
Recovery access is infrastructure.
I still care about clean diagrams, naming standards, and automation, but I care more about independent reachability under stress. In my experience, a reliable out-of-band network is not glamorous security architecture; it is the quiet system that keeps my team from turning a network failure into a building access problem, and I think every serious manufacturing environment should treat it as mandatory.

