Policy-based routing vs route-based routing network decision guide

Policy-Based Routing vs Route-Based: The Mistake That Broke Our Failover

Our dual-WAN failover worked in the lab but silently dropped traffic in production during a primary link outage.

The Lookup Order That Burned Us

I was running FortiOS 7.4.3 on a FortiGate cluster at our manufacturing facility, with two internet circuits, SD-WAN health checks, static default routes, and a handful of source-based exceptions for production VLANs. In the lab, I pulled WAN1, watched the route table prefer WAN2, and called the failover design ready. In production, when the carrier dropped WAN1 during a maintenance window, our ERP terminals, barcode scanners, and a few inspection stations stopped reaching cloud services even though the firewall still had a valid default route through WAN2.

My first assumption was wrong: I thought the routing table would rescue the traffic once WAN1 disappeared. I had treated policy-based routing as a routing preference, when FortiGate was treating it as an instruction. The specific PBR rule matched source subnets from our production floor and forced that traffic to WAN1 regardless of the route-table failover path. The firewall did exactly what I configured, not what I meant.

That distinction matters.

On FortiGate, policy routes are evaluated before the normal destination routing lookup for matching traffic. If a packet matches a policy route by incoming interface, source, destination, protocol, or service, FortiGate can select the configured outgoing interface and gateway before the main route table has a chance to make the decision. Route-based design lets the routing table, static route distance, SD-WAN rules, or dynamic routing protocol decide the next hop. Policy-based routing inserts a higher-priority decision point, and in my environment that priority became a blackhole.

I like route-based designs for most firewall edge work because they fail in ways that are easier to reason about at 2 a.m. PBR is useful, but it should feel like an exception, not the foundation.

Why The Rule Survived The WAN Failure

The mistake was simple and ugly: a PBR rule for specific source subnets forced traffic to WAN1 regardless of its health. We had added it months earlier to keep licensing traffic and vendor support sessions pinned to our primary ISP address. The rule matched our manufacturing VLANs, specified WAN1 as the output interface, and pointed at the WAN1 next-hop gateway. It looked harmless while WAN1 was healthy because traffic moved exactly where we expected.

During the outage, WAN1 was physically up from the FortiGate perspective for part of the event, but the upstream path was unusable. Even when the default route shifted toward WAN2, the policy route still matched the same sources and still selected WAN1. I had health checks on the SD-WAN member, but I had not tied the source-based exception to a failover-aware decision. That gap was enough to take down the traffic that mattered most.

Silent drops are worse than hard failures.

What I didn’t expect was how clean the route table looked while users were still broken. From the CLI, I could see the backup route installed, the WAN2 next hop reachable, and normal test traffic from an IT subnet working. That sent us chasing DNS, application allowlists, and possible ISP filtering before we remembered the old policy route. Once we looked at the packet path from the affected source subnet, the behavior was obvious.

  • PBR matched before the normal route lookup for the affected traffic.
  • The rule selected WAN1 even though WAN2 had a valid backup route.
  • SD-WAN health checks did not automatically protect that standalone PBR rule.
  • IT subnet tests passed because they did not match the same policy route.
  • The outage appeared partial, which made the first triage pass misleading.

I do not trust source-based routing rules unless I can prove their failure behavior under the same conditions that production will see.

Tracing The Decision With Packet Flow Debug

We diagnosed it with packet flow debug on FortiOS 7.4.3, using a source IP from one of the affected production subnets and a known external destination. I ran the capture from an Ubuntu 22.04 jump box over SSH, and I used Python 3.11 later to parse timestamps from the session logs, but the decisive evidence came straight from the FortiGate debug output. The packet was not dying because the route table lacked a backup route. It was being steered before that route mattered.

diagnose debug reset
diagnose debug flow filter clear
diagnose debug flow filter saddr 10.42.18.25
diagnose debug flow filter daddr 8.8.8.8
diagnose debug flow show function-name enable
diagnose debug flow show iprope enable
diagnose debug console timestamp enable
diagnose debug flow trace start 20
diagnose debug enable

# Generate test traffic from 10.42.18.25, then stop debugging.

diagnose debug disable
diagnose debug flow trace stop
diagnose debug reset

The debug showed the source subnet matching the policy route and selecting the WAN1 path. At that point, the routing table failover was not the primary question anymore. The real question was whether any policy decision existed above it that contradicted it. In our case, yes, and it had been there long enough that nobody remembered it during the first ten minutes of the incident.

The firewall was not confused.

I also checked the active routing table, policy route table, and session behavior because cached sessions can make troubleshooting feel inconsistent. We cleared only the affected test sessions, not the entire production session table, because our plant systems were already degraded and I did not want to create a second outage while proving the first one. The command output aligned: backup route present, policy route matched, WAN1 selected, traffic dropped upstream.

Packet flow debug remains my preferred FortiGate truth source because it shows the decision chain instead of the design I thought I had built.

Fixing Source-Based Routing Without Breaking Failover

The fastest production fix was to remove the conflicting PBR rule. Removing the conflicting PBR rule restored failover within one routing convergence cycle — approximately 30 seconds. Before the change, the affected production subnet had 0 successful external test connections out of 25 attempts during the WAN1 outage window. After the change, it had 25 successful connections out of 25 attempts through WAN2, measured from the same test host and destination set.

That was the before and after we needed.

After service was restored, we rebuilt the exception in a safer way. For our environment, the better design was to keep the default WAN decision inside SD-WAN and use SD-WAN rules or route tags where possible instead of standalone PBR pinned to a single dead-end interface. Where we still needed source-based behavior, we documented the dependency, tested link-failure behavior, and avoided rules that named only one physical WAN path without a monitored alternative.

I also changed our review checklist. Any FortiGate change that touches routing now includes a PBR audit, because the route table alone is not enough evidence. We check policy routes, SD-WAN service rules, static route distances, health-check behavior, and real packet flow from the source subnet that matters. Testing from an admin laptop is not equivalent to testing from a CNC cell, an HMI workstation, or a shipping label printer VLAN.

My opinion is that PBR should have an expiration date unless someone can clearly defend why it still exists.

Build The Failover Test Around The Exception

I now test failover by starting with the weird traffic, not the default traffic. If there is a policy route, vendor tunnel, source NAT exception, licensing subnet, inspection bypass, or application-specific path, that is the first thing I test when a circuit fails. The normal route table usually works. The exceptions are where production outages hide.

Our safer implementation pattern is plain: prefer route-based failover, keep source-based steering inside mechanisms that understand link health, and verify with packet flow debug before calling the design complete. If I must use PBR, I keep the match narrow, document the owner, include the reason in the rule comment, and test what happens when the selected path is gone. A rule that only works during perfect network conditions is not a routing design. It is a future incident.

Production does not care that the lab passed.

The lesson I kept was not that policy-based routing is bad. The lesson was that PBR has authority, and authority needs guardrails. In FortiGate environments, especially in manufacturing networks where old exceptions accumulate around uptime promises, I want the routing table and failover logic to stay visible, boring, and predictable. When I override them, I want that override treated like a loaded change, because that is exactly what it is.

Further Reading


·

·