Syslog at Scale: Aggregating 10 Million Events per Day Without Data Loss

When the Audit Found the Gaps

Our rsyslog-based pipeline was dropping 2-3% of events during business hours peaks, and we discovered it during a compliance audit when log gaps were found between firewall denies, endpoint alerts, and badge-system events. I still remember the first review call because my team had dashboards showing green ingestion, green Elasticsearch nodes, and green disk utilization, yet the auditor had a sampled sequence from FortiGate firewalls running FortiOS 7.4.3 that proved events had vanished.

My first assumption was wrong. I blamed the devices. I thought our firewalls, Windows collectors, and Ubuntu 22.04 relay hosts were sending uneven timestamps or duplicate sequence numbers, but the problem was inside our logging path. rsyslog was writing directly to Elasticsearch with no buffering, and Elasticsearch indexing spikes caused TCP timeouts and dropped log events.

That hurt.

At manufacturing scale, logs arrive in ugly waves. Shift changes create authentication bursts, robotic cell controllers chatter after maintenance windows, badge readers spike at lunch, and vulnerability scanners add their own noise. Our normal day was about 10 million events, but business-hour peaks compressed traffic into narrow windows where a pipeline that looked fine on an hourly graph failed minute by minute.

Design the Pipeline Around Back Pressure

We rebuilt the architecture around the idea that every downstream system eventually says “wait,” and the upstream side must have somewhere honest to put data. Our new path became device syslog to regional Fluentd aggregators, Fluentd to Kafka, Kafka to Elasticsearch consumers, and Elasticsearch to retention tiers. The key change was not cosmetic. Kafka became the place where pressure could accumulate without pretending everything was still real time.

I kept rsyslog in a few edge spots because it is reliable, small, and familiar on Linux relays, but I stopped letting it carry the whole compliance story. Fluentd gave my team better parsing control, cleaner routing, and plugin behavior we could test under Python 3.11 helper scripts that generated predictable load. Kafka gave us replay, lag visibility, and a hard separation between collection and indexing.

Buffers buy time.

The before-and-after metric convinced the skeptics faster than any architecture diagram. Adding Kafka as a buffer layer between Fluentd aggregators and Elasticsearch reduced event loss from 2.3% to 0.03% under the same load. I still do not like 0.03% for audit logging, but it moved the failure mode from silent loss to measurable lag, and that is a much better problem to fight.

Put Kafka Where Spikes Used to Break Us

Kafka did not make the pipeline magically reliable. It made unreliability visible and recoverable. We used three brokers on Ubuntu 22.04, replication factor 3 for compliance topics, producer acknowledgments set to all, and Fluentd persistent buffers enabled on local NVMe. That combination gave us two layers of protection: short local survival during network jitter and durable broker storage during Elasticsearch slowdowns.

We split topics by source class instead of dumping every log into one stream. Firewall traffic, identity logs, endpoint security, application events, and plant-floor telemetry each had separate topics with different retention and consumer scaling. That mattered because noisy non-compliance telemetry should never starve FortiOS 7.4.3 security events or domain controller audit logs.

Firewall and VPN events landed in a high-priority Kafka topic with seven-day retention.
Identity events used stricter parsing rules and dead-letter routing for malformed records.
Endpoint telemetry used wider partitions because volume changed sharply during update windows.
Plant-floor controller logs were normalized slowly to avoid blocking security events.
Dead-letter queues were reviewed daily instead of being treated as a trash bin.

What I didn’t expect was how much calmer the Elasticsearch cluster became. We had tuned heap, shards, refresh intervals, and index templates for months, but those changes only helped after Kafka stopped turning every collection surge into an indexing emergency. My opinion after that rebuild is simple: Elasticsearch should search logs, not absorb every bad timing decision in the pipeline.

Choose Aggregators for Operations, Not Fashion

I tested rsyslog, Fluentd, and Logstash under the same replay set because I wanted fewer opinions and more packet captures. rsyslog was excellent at fast forwarding and simple filtering, but our parsing logic had become too complex for the team to maintain cleanly. Logstash had strong filters, especially for teams already deep into Elastic patterns, but it consumed more memory than I wanted on regional collectors. Fluentd sat in the middle with enough structure, acceptable resource use, and plugins that matched our routing plan.

Our Fluentd nodes ran with file buffers, explicit chunk limits, and backoff settings that we could explain during an audit. I cared less about perfect elegance and more about whether a night-shift engineer could understand where an event sat at 2:00 a.m. during a WAN issue. Operational clarity beat theoretical throughput.

You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.

Simple wins at 2:00 a.m.

The basic health check looked boring, which is usually a good sign. We compared sent counts from device classes, Fluentd input counts, Kafka topic offsets, consumer commits, and Elasticsearch document counts over the same five-minute window. I used Python 3.11 scripts for reconciliation because I wanted repeatable checks outside vendor dashboards, and I wanted CSV output the compliance team could archive.

#!/usr/bin/env bash
set -euo pipefail

TOPIC="firewall-fortios"
WINDOW_MINUTES=5

kafka-consumer-groups.sh \
  --bootstrap-server kafka01:9092 \
  --describe \
  --group elasticsearch-firewall-consumers

curl -s "http://fluentd01:24220/api/plugins.json" | jq '.plugins[] | {id, retry_count, buffer_queue_length}'

python3.11 reconcile_counts.py \
  --topic "$TOPIC" \
  --window-minutes "$WINDOW_MINUTES" \
  --elasticsearch-index "firewall-*"

I still like rsyslog for what it is good at, and I still like Logstash when Elastic-heavy enrichment is the center of the job. For our environment, Fluentd plus Kafka gave my team the best balance of speed, inspectability, and failure recovery, and I would choose that again.

Keep Measuring Loss, Lag, and Replay

Monitoring the pipeline meant admitting that “logs are flowing” was not a control. We built alerts around Kafka consumer lag, Fluentd retry counts, buffer queue length, broker disk growth, Elasticsearch indexing latency, and dead-letter volume. I also wanted source-side counters where possible, especially on firewalls and authentication systems, because pipeline metrics without source truth can still lie.

We added synthetic audit markers every minute from controlled Ubuntu 22.04 relays. Each marker carried a source ID, sequence number, and UTC timestamp, then flowed through the same Fluentd, Kafka, and Elasticsearch path as production events. Missing markers became a page. Late markers became a warning. Duplicate markers became a parsing investigation.

Trust counts.

The hardest argument was about acceptable loss. Some people heard 0.03% and treated it as close enough because the graph looked clean. I disagreed then, and I disagree now. For compliance logging, event loss is not just a reliability metric; it is an evidence problem, and the pipeline has to prove when it is healthy and when it is behind.

My team eventually treated Kafka lag like a queue at a shipping dock. A queue is fine when trucks are arriving faster than receiving can unload, but only if the queue is bounded, visible, and draining. Silent drops are different. Silent drops are a broken chain of custody wearing a green dashboard.

A logging pipeline is only trustworthy when it can slow down without lying.

Syslog at Scale: Aggregating 10 Million Events per Day Without Losing Data

When the Audit Found the Gaps

Design the Pipeline Around Back Pressure

Put Kafka Where Spikes Used to Break Us

Choose Aggregators for Operations, Not Fashion

Keep Measuring Loss, Lag, and Replay

External References