When Our Core Would Not Sit Still
Our core OSPF domain was reconverging 40-60 times per hour, and users reported intermittent connectivity issues that IT couldn’t reproduce. I had operators on the plant floor losing access to MES screens for five seconds, then everything came back before our help desk could remote in. My team saw no obvious firewall drops on FortiOS 7.4.3, no WAN outage, and no clean switch port-down event.
I made the wrong first assumption. I blamed a recent access-layer change because the timing looked suspicious, and I spent the first hour checking VLAN trunks, ACL counters, and authentication logs instead of trusting what the routing table was trying to tell me.
The core routers were not failing. They were obeying OSPF perfectly.
Our log sample showed 52 OSPF reconvergence events in one hour before the fix. After replacing the faulty fiber patch cable and implementing BFD for faster failure detection, our OSPF reconvergence events dropped to 0 in the next 24-hour monitoring window. That was the moment I stopped treating route flapping as a routing problem first; in a real manufacturing network, I treat it as a physical stability problem until the evidence proves otherwise.
How I Watched OSPF Decide a Neighbor Was Dead
In our environment, OSPF neighbor health depended on hello packets arriving on time across the point-to-point core links. When a router stopped receiving hellos before the dead timer expired, it declared the neighbor down, flushed LSAs, recalculated SPF, and updated the routing table. That sequence is normal. The damage came from the repetition.
Each neighbor drop triggered SPF work across the OSPF area. A single unstable uplink caused routers that were otherwise healthy to recalculate paths, update next hops, and briefly move traffic through alternate links. Our FortiGate firewalls running FortiOS 7.4.3 did not cause the issue, but stateful inspection made the user complaints louder because short path changes exposed fragile application sessions.
Small failures echoed loudly.
What I didn’t expect was how quiet the interface looked at first glance. The port never went administratively down, SNMP did not scream, and our dashboard stayed mostly green. The degraded fiber patch cable was causing link-layer errors that triggered OSPF neighbor drops without generating a port-down event, which made the problem feel random until I lined up CRC counters, OSPF adjacency logs, and SPF timestamps.
My opinion is simple: if OSPF keeps reconverging and the application team says “random,” I start with the link, not the routing protocol.
Tune Hello and Dead Timers With Restraint
Our default OSPF timers were stable for years: 10-second hellos and a 40-second dead interval on broadcast-style segments, with faster values on point-to-point core links where we had tighter control. I have seen teams lower timers aggressively because they want faster failover, then accidentally create a routing domain that punishes every brief packet-loss burst.
Timer tuning has a place, but I do not use it as a bandage for bad optics, dirty fiber, overloaded CPUs, or duplex mismatches. On Cisco platforms, both neighbors must agree on hello and dead intervals or adjacency formation fails. That makes timer changes operationally risky in a mixed maintenance window, especially when security appliances, distribution switches, and core routers all participate in the same area design.
interface TenGigabitEthernet1/1/1
description CORE-OSPF-LINK-A
ip ospf network point-to-point
ip ospf hello-interval 1
ip ospf dead-interval 4
ip ospf 10 area 0
router ospf 10
log-adjacency-changes detail
timers throttle spf 50 200 5000
timers throttle lsa 50 200 5000
Fast timers expose weak links faster.
I now treat OSPF timer work as a controlled engineering change. My team documents the current values, confirms platform support, tests CPU behavior, and verifies that logging can distinguish a true failure from an unstable physical layer. I prefer conservative OSPF timers plus BFD over ultra-aggressive OSPF timers alone because routing protocols should not have to act like cable testers.
Use BFD When Failure Detection Needs Precision
BFD gave us the clean failure signal we wanted without forcing OSPF hellos to carry all detection responsibility. We configured BFD on the core routed links so neighbor failure detection moved into a lightweight protocol built for rapid liveness checks. OSPF still handled topology and SPF, but BFD told it when a forwarding path was genuinely unusable.
Our operational stack around the fix was plain: Cisco core routing, FortiOS 7.4.3 at the security edge, Ubuntu 22.04 for our monitoring collector, and Python 3.11 scripts parsing syslog into hourly reconvergence counts. The Python job was not fancy, but it made the before/after argument impossible to dodge: 52 events before, 0 events after.
You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.
- I enabled detailed OSPF adjacency logging before changing timers.
- I checked CRC, input error, and optical receive levels on both ends.
- I replaced the suspect fiber patch cable before blaming software.
- I enabled BFD only on controlled core links with known neighbors.
- I watched SPF counters for a full production shift after the change.
Evidence beats instinct.
BFD is not magic. If I enable it everywhere without understanding link quality and platform load, I can create a faster failure loop instead of a more stable network. Used carefully, though, BFD is exactly the tool I want in a manufacturing core where five seconds of routing uncertainty can interrupt scanners, HMIs, badge systems, and production reporting. My bias is firm here: sub-second detection belongs in BFD, not in reckless OSPF timer cuts.
Trace Flaps With Logs, Counters, and Time
The breakthrough came when I stopped reading each log line alone and built a timeline. OSPF adjacency dropped, SPF ran, traffic shifted, users complained, and interface counters increased. The missing piece was that the physical interface stayed up the whole time, so our first alerting path never fired.
On Cisco gear, I used detailed adjacency logging and targeted debug during a maintenance window. I do not leave broad OSPF debug running in production because noisy control-plane logging can become its own problem. I capture narrow evidence, export it, and compare it with switch interface counters and firewall session logs.
show ip ospf neighbor detail
show ip ospf interface tenGigabitEthernet1/1/1
show ip ospf statistics
show interface tenGigabitEthernet1/1/1 counters errors
debug ip ospf adj
undebug all
The cable never looked guilty from ten feet away.
Once the bad patch cable was replaced, the CRC increments stopped, the OSPF neighbor stayed full, and our reconvergence counter flatlined. BFD then gave us deterministic failure detection for future hard failures, which is different from masking the original issue. I like that distinction because permanent fixes should remove instability, while detection improvements should make the next fault easier to prove.
Stabilize the Link Before Blaming OSPF
I still see engineers treat OSPF flapping like a configuration puzzle first. My experience says otherwise. In our plant, the routing protocol was the messenger, and the message was that one physical path could not be trusted. The protocol did exactly what it was designed to do: protect reachability when a neighbor looked dead.
My team now keeps a short runbook for this class of incident. We pull OSPF neighbor history, compare SPF counters, inspect interface errors, check optics, validate timer consistency, and only then change protocol behavior. We also keep our Ubuntu 22.04 collector and Python 3.11 parser close to the routing logs because exact counts end arguments faster than screenshots.
One unstable link can embarrass an entire design.
The hard lesson was not that OSPF reconverges. I already knew that. The hard lesson was that a degraded cable could create enough link-layer loss to drop neighbors while staying quiet enough to avoid the normal port-down alarms. My opinion after that incident is blunt: before I tune OSPF, I prove the wire, the optic, and the error counters are clean.

