FortiGate HA Failover That Wasn’t: Debugging an Active-Passive Cluster That Never Failed Over

FortiGate HA Failover That Wasn’t: Debugging an Active-Passive Cluster That Never Failed Over

The Night Our Secondary Stayed Quiet

Primary unit went offline during maintenance and the secondary never promoted itself. I was standing in our manufacturing control room watching FortiGate sessions disappear while the plant network stayed dark, and the HA pair that I had described as “covered” in change meetings sat there like two independent firewalls with matching labels.

Our environment was a FortiGate active-passive cluster running FortiOS 7.2 at the time, later rebuilt and tested on FortiOS 7.4.3. We had PLC support networks, engineering workstations, vendor VPNs, MES servers, and a few ugly legacy paths that nobody wanted to touch during production hours. The outage lasted 3 hours before we restored the primary path manually.

I was wrong.

My first assumption was that the secondary had a firmware or config sync problem. I chased that angle too long. The real mistake was simpler and worse: heartbeat interfaces were on the same switch as the management traffic, so a switch failure killed both simultaneously. We had designed redundancy on paper and then routed the failure detector through the same failure domain. That is bad engineering, and I own it.

My opinion now is blunt: if heartbeat isolation is not physically boring, the HA design is not mature.

How Heartbeat Detection Failed Without Looking Broken

FortiGate HA heartbeat detection depends on cluster members seeing each other across configured HA heartbeat interfaces. In active-passive mode, the secondary does not promote itself just because the primary has an issue; it promotes when HA election logic says the peer is gone, priority rules allow takeover, monitored interfaces support the decision, and the cluster state is clean enough to act.

That distinction mattered in our rack. The management switch carried admin access, monitoring, and both heartbeat links. When the switch dropped during maintenance, both FortiGates lost the path used to judge peer health at the same time. From the firewall perspective, the story was not a clean primary failure. It was a messy shared network event.

Silence is a signal.

The part I disliked most was how normal the configuration looked during casual review. The cluster showed synchronized before the maintenance window. Config checks passed. The secondary had the expected priority. Nothing in the GUI screamed that our heartbeat design depended on one cheap layer-2 device staying alive.

What I didn’t expect was how quickly my team’s confidence collapsed once we traced the topology. We had treated “HA status synchronized” as proof of resilience, but sync status only proved that config replication worked at that moment. It did not prove that failover would happen under the failure mode we actually created.

My opinion: HA status lights are comfort indicators, not evidence.

Heartbeat Patterns I No Longer Accept

After the outage, I pulled our old diagrams, switch configs, FortiGate HA settings, and monitoring notes into a small review script written in Python 3.11 on Ubuntu 22.04. The script did not make decisions for us, but it forced a simple inventory: which physical ports carried heartbeat, which switches they landed on, which VLANs were shared, and which cables crossed the same patch panel.

The bad patterns were familiar once I stopped defending the original design:

  • Heartbeat links placed on the same access switch as management traffic.
  • Both HA ports patched through the same switch stack without independent power.
  • Monitored interfaces configured too broadly, causing noisy elections during minor access-layer events.
  • No scheduled failover test after firmware upgrades or cabling changes.
  • SNMP alerts watching cluster state but ignoring heartbeat port health.

One switch is not redundancy.

We moved to two dedicated heartbeat paths with separate switching and power, then documented the cabling as part of our firewall standard. We also kept one direct cable path available for controlled testing because direct HA links remove a surprising amount of ambiguity when I want to prove peer detection.

The before-and-after metric changed the conversation with management. Before the fix, the secondary never promoted and the outage lasted 3 hours. After fixing heartbeat isolation, failover completed in 8 seconds in test. That single number did more than my diagram ever did.

My opinion: a failover architecture without a measured failover time is just a hope with rack ears.

Test Failover Without Turning Maintenance Into Drama

I do not test HA by yanking random cables during production anymore. I use a written procedure, a rollback owner, a console session on both units, and a production-impact boundary that everyone understands before the first command runs. Our safest tests happen during maintenance windows with application owners watching the flows that matter, not just ICMP from my laptop.

My baseline procedure now starts with state, not action. I verify config sync, session pickup, monitored interfaces, firmware version, priority, override behavior, and routing neighbors. On FortiOS 7.4.3, I also record the exact HA checksum and session count before touching anything because arguments after a failed test are cheaper when the baseline is captured.

get system ha status
diagnose sys ha checksum show
diagnose sys ha status
get router info routing-table all
diagnose sys session stat
execute ha manage 1

Then I test narrowly. First I lower priority on the primary or issue a controlled HA failover command, depending on the approved window and platform behavior. I watch event logs, upstream switch MAC movement, VPN re-establishment, SD-WAN health checks, and critical manufacturing flows. If the MES path and vendor VPN recover but a lab VLAN drops, I still care, but I do not confuse that with a plant outage.

Small tests expose large lies.

You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.

My rollback is equally plain. If promotion does not complete inside our target window, I restore priority, re-enable the original primary, and stop the test. I do not improvise inside an outage clock unless life safety or production safety requires it. For us, the 8-second failover test became the acceptance bar, and any later change that cannot meet it gets treated as a regression.

My opinion: controlled failover tests should feel boring, because excitement means the plan was incomplete.

Monitor Sync Like A Failure Is Already Forming

After the fix, I changed our monitoring from “is the firewall reachable” to “is the cluster ready to lose a member.” That sounds subtle, but the data model changes. Reachability tells me a box answers. HA readiness tells me whether the environment can absorb the next maintenance mistake, failed PSU, bad switch reload, or surprise fiber move.

We alert on HA role changes, checksum mismatch, heartbeat interface down, split-brain indicators, monitored interface failures, and unexpected cluster member count. SNMP still helps, but I prefer event log forwarding into our SIEM because the sequence matters. A heartbeat flap followed by a role election is different from a clean administrative failover, and I want that story preserved.

Logs need context.

I also added a small daily check from Ubuntu 22.04 using Python 3.11 to compare the expected cluster member serial numbers against what FortiGate reports. That check is intentionally simple. If the secondary disappears quietly, I want a plain alert before the next change window, not a forensic discovery after the plant manager calls.

We still review HA manually after FortiOS upgrades, including the move from FortiOS 7.2 to FortiOS 7.4.3 in our lab validation path. I do not trust upgrade notes alone for failover behavior in our environment. Manufacturing networks have enough old dependencies that “supported” and “safe for our plant” are not the same claim.

My opinion: monitoring should prove readiness, not decorate uptime.

Prove The Cluster Before Trusting It

I now treat every active-passive firewall pair as untested until I have watched the passive unit take over under controlled conditions. Not reviewed. Not synchronized. Not licensed. Tested. The difference matters because HA is a behavior, and behavior only exists when the system is forced to act.

Our failed maintenance window changed how I write firewall standards. Heartbeat links get physical separation. Management traffic stays out of the heartbeat path. Failover testing has an owner, a schedule, and a measured recovery target. Version numbers, cabling paths, and test results live beside the diagram because diagrams age badly when nobody attaches evidence.

Evidence beats confidence.

The secondary never promoted because we built a shared dependency into the one mechanism responsible for detecting failure. That is the kind of design flaw that hides behind green dashboards for years. I would rather find it during a planned Tuesday night test than during a production line stoppage with operators waiting and vendors asking why their tunnel vanished.

My opinion: an HA cluster that has never failed over is not insurance; it is an unanswered question.

The first real HA test is the moment the passive firewall becomes responsible for the plant.

External References


·

·