We spent three months building what we thought would be a simple automation pipeline: ingest runbooks, make them searchable, let Claude answer ops questions from them. By week four, we had four broken prototypes, two abandoned vector databases, and a Slack thread so long that no one read it anymore. What finally worked looked almost nothing like our original design.
Why the First Architecture Failed
Our first attempt was the obvious one: dump everything into a vector database, embed queries, retrieve top-k chunks, feed them to Claude. It worked in demos. It failed in production because our runbooks were written by six different engineers over four years, each with a different formatting style, level of detail, and assumption about who the reader was.
Retrieval quality was the bottleneck, not generation. The model was good — the context it received was garbage. We were retrieving chunks that were semantically similar to the query but operationally useless. A query like “how do we handle FortiGate policy sync failures” would surface three different documents that each assumed the reader already knew the other two.
What I didn’t expect was that adding more documents made things worse, not better. Each new runbook diluted retrieval precision. By month two, we had 340 documents and a retrieval hit rate that had dropped from 71% to 44%.
The RAG Architecture That Actually Held Up
We rebuilt around a two-stage retrieval model. The indexing pipeline and the query pipeline are now completely separate services, which sounds obvious in retrospect but was not how we started.
Indexing pipeline (runs on commit to the runbook repo):
- Document ingestion with metadata extraction (author, last-updated, affected-systems tags)
- Semantic chunking by section heading, not by fixed token count
- Dual embedding: dense vectors via
text-embedding-3-large+ sparse BM25 index - Chunk-level confidence score assigned at index time based on document freshness and cross-references
- Storage in Qdrant with payload filters for system tags
Query pipeline (handles live ops questions):
- Query rewriting step: Claude rewrites the raw question into two variants before retrieval
- Hybrid retrieval: dense + sparse results merged via Reciprocal Rank Fusion
- Re-ranking with a cross-encoder before final context assembly
- Confidence threshold gate: if the top result scores below 0.65, the system says so rather than hallucinating
After the rebuild, retrieval hit rate went from 44% back to 79% on the same document set. The re-ranking step alone accounted for about 18 percentage points of that gain.
Where Claude Code and MCP Changed the Pipeline
The RAG layer handles search. Claude Code with MCP handles everything else. We run three MCP servers locally, all connected via stdio, which keeps latency under 200 ms per tool call compared to the 2+ seconds we saw with HTTP-based tool endpoints.
The MCP servers we actually use in production:
- Runbook MCP: exposes our RAG query pipeline as a tool. Claude can call
search_runbooks(query, system_tag)and get back ranked results with source citations. - FortiGate MCP: wraps our FortiOS REST API. Claude can read policy tables, check HA status, pull interface stats — read-only by design, with every call logged.
- Incident MCP: connects to our ticketing system. Claude can open tickets, add comments, and escalate, but cannot close or resolve without a human approval step.
The governance rule we enforced from day one: every MCP tool that writes or changes state requires an explicit confirmation before execution. Claude proposes the action, a human approves it in Slack via a bot command. Read-only tools run autonomously.
Integrating Codex for Code Generation Tasks
Claude Code handles reasoning and tool orchestration well. Codex handles bulk code generation tasks better, especially when the task is parallelizable. We use both, but we are deliberate about which work goes where.
Our current split: Claude Code owns the agentic loop — it decides what to do, calls MCP tools, and assembles the final response. Codex runs as a subagent for tasks like generating Python scripts from a spec, writing test cases from a runbook, or producing Ansible playbooks from a network change request. Codex can spawn multiple specialized agents in parallel, which matters when we need to generate both the implementation and the tests at the same time.
The handoff works via a file-based IPC pattern. Claude Code writes a spec to a shared directory, triggers a Codex exec subprocess, polls for a completion marker, and reads the result back. It is not elegant. It is reliable, which matters more in a production ops context.
# Claude Code triggers Codex for a parallel generation task
result = subprocess.run(
["codex", "exec", "--sandbox", "workspace-write", "-o", str(result_file), "-"],
input=full_prompt,
text=True,
encoding="utf-8",
timeout=300,
capture_output=True,
)
The Observability Gap We Ignored Too Long
We shipped the pipeline without proper observability and paid for it. For the first six weeks in production, we had no visibility into which RAG retrievals were leading to good answers versus which were leading Claude to confidently say wrong things. We knew the output but not the path.
We added LangSmith tracing to the query pipeline and started tagging every retrieval with the confidence score, the query rewrite variants, and the final re-ranked order. Within two weeks of having that data, we found that 23% of our low-confidence retrievals were caused by a single category of runbooks: ones written before we standardized on a section heading format in 2024. We re-indexed those 47 documents and saw average retrieval confidence jump by 11 points.
The lesson: instrument your retrieval pipeline the same way you instrument application code. If you cannot see why a retrieval succeeded or failed, you cannot improve it.
What We Would Build Differently
Start with the retrieval quality problem before building the agent layer. We wasted four weeks building agent orchestration around a broken retrieval foundation. The agent layer amplifies whatever the retrieval gives it — bad retrieval produces confidently wrong agents.
Enforce the read-write boundary in MCP from the start. Adding human approval steps after the fact is harder than building them in. Every write-capable tool should require explicit confirmation before you have a reason to regret it.
For teams starting this in 2026, the stack worth considering: Qdrant or Weaviate for the vector store, LlamaIndex for the retrieval orchestration layer, Claude Code for the agentic loop, MCP servers for tool integration, and LangSmith or Arize Phoenix for observability. That is not the only way to build this. It is the configuration that caused us the least production pain.

