When Confluence Failed During Failover
During a P1 incident, one of our engineers could not find the firewall failover runbook in Confluence. The document existed, but our team had named it “HA pair recovery procedure” two years earlier during a FortiOS 7.4.3 upgrade. Our search phrase was “firewall failover runbook,” and Confluence gave us stale meeting notes, a vendor PDF, and one page about switch redundancy.
I was watching our OT network dashboard while a packaging line was losing telemetry from a PLC cell. We had redundant FortiGate firewalls, redundant WAN circuits, and a perfectly usable runbook. We still burned minutes because our documentation system expected calm, exact keywords from people working under incident pressure.
I made the wrong first assumption: I blamed bad documentation hygiene. After the post-incident review, I saw the real failure. Keyword search fails when incident adrenaline changes how people phrase queries; semantic search via RAG matches intent, not exact keywords. That distinction mattered more than any folder cleanup campaign I had ever pushed.
Search was the bottleneck.
Our baseline was ugly but measurable. Across 27 incident reviews, our engineers averaged 8 minutes finding the right operational document in Confluence, and our query success rate was 64%. After deploying a RAG system over 340 operational runbooks, incident query success rate improved from 64% to 91%. I care more about that number than any polished chatbot demo.
Build RAG Around Operations, Not Demos
Our architecture stayed deliberately boring. We exported Confluence pages through the API, normalized the HTML, split documents into operational chunks, embedded those chunks, stored them in a vector database, and sent retrieved context into Claude through the Claude API model claude-3-5-sonnet-20241022. The ingestion service ran on Ubuntu 22.04 with Python 3.11, scheduled from our existing automation host.
I did not want the system answering from general internet knowledge during an incident. I wanted it grounded in our runbooks, our device names, our escalation paths, and our maintenance windows. When I asked, “How do we force firewall B active if A is flapping?” the system needed to pull our HA recovery procedure, not a generic Fortinet blog post.
Our request flow looked like this:
from rag_ops import retrieve_chunks, ask_llm
query = "FortiGate HA failover is stuck, how do we force secondary active?"
chunks = retrieve_chunks(
query=query,
collection="ops_runbooks_v3",
top_k=6,
filters={"environment": "manufacturing", "approved": True}
)
answer = ask_llm(
model="claude-3-5-sonnet-20241022",
system="Answer only from retrieved operational documentation.",
context=chunks,
question=query
)
print(answer.with_citations())
That small filter on approved documentation saved us from a real mess. We had draft pages, vendor staging notes, and old network diagrams still living in Confluence. A RAG system does not magically know which page our team trusts; our metadata had to say so. My opinion is simple: operational RAG without document governance is just faster confusion.
Chunk Runbooks Like Engineers Use Them
Chunking decided whether retrieval felt sharp or vague. At first, I split every page into fixed 800-token windows with 100-token overlap because that was the default in a library example. That worked for policy documents and failed on runbooks. A firewall failover page has prerequisites, commands, expected outputs, rollback steps, and escalation contacts; splitting blindly can separate the command from the warning that makes it safe.
We changed our chunking around operational intent. Each chunk carried the parent page title, section heading, device family, owner team, last reviewed date, and severity scope. We kept command blocks attached to the paragraph that explained when to run them. We also embedded page titles alongside chunk text because engineers often remember symptoms, not document names.
Context beats cleverness.
Our best-performing chunks followed a few rules:
- Keep one operational action and its validation steps together.
- Attach rollback notes to the same chunk as the risky change.
- Preserve exact device names, interface labels, VLAN IDs, and FortiOS 7.4.3 command syntax.
- Store source links, page owners, and review dates as metadata.
- Exclude abandoned drafts unless a human explicitly approves them.
What I did not expect was how much the section headings mattered. “Force secondary active” retrieved better than “Step 4” even when the body text was identical. We started rewriting headings during runbook reviews, not for style, but for retrieval accuracy. I now treat headings as operational metadata, and I think most teams underinvest there.
Choose the Vector Store for Failure Modes
For our scale, vector database selection was less about raw benchmark speed and more about backup, access control, operational familiarity, and boring recovery behavior at 2 a.m. We tested pgvector 0.5.1 on PostgreSQL 15.4, OpenSearch 2.11, and Qdrant 1.7.4. All three could handle 340 runbooks without breathing hard.
We chose pgvector because our team already operated PostgreSQL, our backup process already passed audits, and our access model was easy to explain. OpenSearch had stronger hybrid search knobs, and Qdrant felt clean, but the best database for our environment was the one my team could restore confidently during a bad shift.
Latency was fine. Trust was harder.
You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.
We stored embeddings, chunk text, metadata, checksum, source URL, and ingestion timestamp. Every answer returned citations back to Confluence. We also kept a retrieval-only mode so engineers could see matching documents without generated prose. That mattered for skeptical senior admins who wanted evidence before they trusted a generated answer during a firewall event.
I also added deletion handling early. If a runbook was retired in Confluence, the next sync marked its chunks inactive instead of leaving orphaned operational advice in the index. RAG systems rot quietly when source lifecycle is ignored. My opinion: stale retrieval is more dangerous than slow retrieval.
Measure What Engineers Actually Do
I did not measure token counts first. I measured whether engineers found the right document. Our evaluation set came from real incident queries, post-maintenance questions, and phrases pulled from chat logs after removing sensitive details. We labeled the expected runbook for each query, then tracked whether the correct document appeared in the top three retrieved results.
Before deployment, Confluence search found the right document for 64% of test queries. Our first RAG prototype reached 82%. After metadata filters, heading cleanup, and chunk boundary fixes, we reached 91%. The remaining failures were mostly ambiguous acronyms and runbooks that were genuinely missing.
Adoption needed less ceremony than I expected. We put the interface where engineers already worked: the incident channel and our internal operations portal. The answer format stayed strict: short answer, cited source, confidence note, and escalation owner. If the system lacked enough context, it said so and gave the closest documents instead of pretending.
No fake certainty.
We reviewed failed queries every Friday for a month. Some fixes belonged in code, but many belonged in documentation. Missing synonyms, lazy headings, old ownership fields, and duplicate runbooks all surfaced quickly. RAG did not replace documentation discipline; it exposed where our discipline had slipped. I like tools that make weak process visible.
Keep the System Accountable During Incidents
My final design rule is that RAG should support incident command, not become incident command. I do not want a model deciding whether we fail over a firewall, restart an MES connector, or isolate a switch stack. I want it to find the approved procedure, explain the relevant step, and show exactly where that guidance came from.
We added audit logs for every query, retrieved chunk, generated answer, and clicked citation. That gave us a feedback loop without turning engineers into data-entry clerks. During reviews, we could see whether the system helped, missed, or pointed at a weak runbook. The audit trail also made our security manager more comfortable because we could prove the system was grounded in internal approved documents.
I still keep the system scoped. It does not ingest vendor forums, Slack rumors, or unresolved design notes. It does not answer without citations. It does not hide low confidence behind fluent language. Those constraints make the tool less flashy, and I prefer it that way.
The win was not artificial intelligence replacing experienced operators. The win was experienced operators getting to the right procedure faster while the plant floor was waiting. In our environment, that is the kind of automation I trust.

