Cloud WAF Rule Tuning: Reducing False Positives Without Disabling Protection

Cloud WAF Rule Tuning: Reducing False Positives Without Disabling Protection

Trace the Managed Rule Group False Positives First

Our AWS WAF blocked 400 legitimate API requests on the first day it went live, and the engineering team disabled it within 6 hours to restore service. I was on the manufacturing facility’s IT security team that approved the rollout, and I still remember the first escalation because it sounded like a normal application outage until we saw the 403s stacked behind CloudFront.

We had enabled AWS managed rule groups directly in block mode without baseline traffic analysis. That was my mistake. I assumed the managed rules would be conservative enough for our API traffic because we were not hosting a public forum, upload portal, or anything that looked unusually risky. The SQLi detection was triggering on legitimate JSON payloads with SQL-like syntax from our plant reporting system.

That assumption did not survive contact with production.

Our environment was not exotic. We had API Gateway in front of Python 3.11 services, Ubuntu 22.04 workers, and a separate edge stack where FortiOS 7.4.3 protected remote access into the plant network. The problem was not that AWS WAF was bad. The problem was that our payloads contained query fragments, operator names, and report definitions that looked hostile when viewed without application context.

When a WAF blocks a customer dashboard or a production reporting workflow, nobody in the business calls it a security control doing its job. They call it downtime, and they are right to do that. My opinion now is simple: managed rules are a starting position, not a launch plan.

Run Count Mode Before Block Mode

After that rollback, we rebuilt the rollout around count mode. Every managed rule group stayed attached, but actions were overridden to count while we collected logs and compared matched rules against real user behavior. I wanted three full weeks because our facility traffic had weekday production patterns, weekend maintenance windows, and month-end reporting spikes that did not show up in a two-day test.

The safe method was boring, and that was the point.

We treated the WAF like a change to routing, authentication, or DNS. We opened a change record, named owners for review, and agreed that no rule could move to block until we had evidence from real traffic. I also asked the application team to give us sample payloads for the endpoints most likely to trip SQLi, XSS, size restriction, and known bad input rules.

  • Attach managed rule groups with rule action overrides set to count.
  • Send full WAF logs to a dedicated S3 bucket with retention defined upfront.
  • Review top terminating and non-terminating rule matches by endpoint.
  • Separate obvious hostile traffic from legitimate application payloads.
  • Create narrow exclusions only where we had repeatable evidence.
  • Move rules to block in phases, not as one big switch.

After running all rules in count mode for 3 weeks and adding 11 targeted exclusions, the WAF went live in block mode with zero false positive incidents in the following 60 days. That metric changed the conversation with engineering because we were no longer asking them to trust a control; we were showing them production evidence. I think count mode is the difference between security engineering and security gambling.

Write Exclusions That Preserve the Rule

The wrong fix would have been disabling the SQLi managed rule group. Under pressure, that was exactly what people wanted, and I understood why. If one rule breaks a line-of-business API, the fastest path back to green is to turn the whole thing off. My job was to make that option unnecessary.

Small exclusions beat big exceptions.

We created exclusions around specific request components, paths, and labels. For example, one endpoint accepted JSON report definitions with strings like select, where, and join because operators built saved views in the manufacturing dashboard. That did not mean all SQLi checks should disappear from the site. It meant that one JSON field on one API path needed different treatment.

aws wafv2 get-sampled-requests \
  --web-acl-arn arn:aws:wafv2:us-east-1:123456789012:regional/webacl/prod-api/abcd1234 \
  --rule-metric-name AWSManagedRulesSQLiRuleSet \
  --scope REGIONAL \
  --time-window StartTime=2026-04-01T00:00:00Z,EndTime=2026-04-01T01:00:00Z \
  --max-items 100

We also avoided exclusions based only on source IP. In a plant environment, IP-based exceptions are tempting because internal systems have stable addresses, but they become undocumented trust shortcuts. I preferred matching on URI path, JSON body field, header context, and AWS WAF labels because those conditions explained why the traffic was legitimate.

What I didn’t expect was how many false positives came from internal automation rather than external users. Our nightly jobs sent dense JSON bodies that no human would create through the UI, and those bodies were exactly where generic detection became noisy. My opinion is that every WAF tuning review needs machine-to-machine traffic in scope from day one.

Build a Logging Workflow We Can Actually Use

At first, our logs were technically available but practically useless. We had records in S3, but analysts had to dig through raw JSON while engineers were asking which endpoint broke and why. That gap matters during an incident because slow answers create pressure to disable protection.

You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.

Visibility has to be fast.

We moved WAF logs into a predictable S3 prefix, queried them with Athena, and built saved queries for rule ID, URI path, action, labels, client IP, and sampled body indicators. We did not store sensitive body data longer than needed, and we worked with application owners to understand which fields were safe to inspect. The goal was not unlimited packet archaeology. The goal was quick rule decisions.

The most useful query grouped matches by rule, path, and action over a rolling window. That let us see whether a rule fired once against obvious scanning or 900 times against the same legitimate endpoint. We also tracked count-to-block candidates in a shared review sheet so the tuning decision had a visible owner, date, and reason.

I care less about perfect dashboards than repeatable questions. Which rule matched? Which endpoint was hit? Was the request authenticated? Did the same payload work before the WAF? Did the application team confirm it was expected behavior? If the workflow cannot answer those quickly, the WAF will lose the first political fight it enters.

Keep Protection On When Pressure Builds

The hardest part of WAF tuning is not syntax. It is keeping enough confidence in the control that nobody reaches for the off switch when a high-priority workflow fails. We learned that confidence comes from staged enforcement, clear rollback criteria, and exclusions that are narrow enough to defend six months later.

Pressure exposes weak tuning.

When we relaunched, we did not move every rule into block at once. We started with rules that had clean count-mode histories, then handled noisier detections after exclusions were reviewed. We kept an emergency procedure, but it disabled individual rules or rule actions before it disabled the whole web ACL. That distinction mattered because it preserved most of the protection even during troubleshooting.

I now treat cloud WAF deployment like firewall policy work at the plant edge. I would never push a broad FortiOS 7.4.3 policy change on a Friday without traffic review, logging, and a rollback plan, so I should not treat AWS WAF differently just because it has managed rule groups and a friendly console. Managed does not mean context-aware.

My final opinion is blunt: a WAF that gets disabled during the first outage was never really deployed. It was only tested on production users. Count mode gives us the evidence to keep protection turned on when the business is watching.

A WAF false positive is not a minor alert when it blocks real work; it is an outage with a security label.

External References


·

·