Claude Code multi-agent workflows orchestrating complex tasks

Claude Code Multi-Agent Workflows: Orchestrating Complex Infrastructure Tasks

Choosing Multi-Agent Only When One Claude Session Starts Lying

A quarterly security audit required pulling data from six different systems and generating a report — one agent doing all of it produced garbage output. I had FortiGate policy exports from FortiOS 7.4.3, Tenable CSVs, Microsoft Defender alerts, switch inventory, firewall change tickets, and a pile of exception notes from our manufacturing network. The first Claude Code session looked confident, stitched everything together, and quietly mixed asset IDs between production lines.

My wrong first assumption was that more tools inside one session would make the job cleaner. I gave one agent shell access, Python 3.11 scripts, CSV parsing instructions, report formatting rules, and audit criteria. It could retrieve data, but it could not maintain enough role discipline. By the time it wrote the report, it was half analyst, half ETL job, half policy reviewer, and none of those jobs were being done well.

That failure pushed my team toward multi-agent workflows. I use a single Claude session when the task has one dominant shape: refactor this script, inspect this log set, write this parser, or explain one misconfigured VPN policy. I use multi-agent orchestration when the work has separate judgment domains that should not bleed into each other.

Scope beats speed.

Our audit pipeline now runs four agents: collector, normalizer, control reviewer, and report writer. The collector never writes conclusions. The report writer never calls production tools. The reviewer never mutates files. That role separation feels slower on paper, but quarterly audit preparation time dropped from 3 days manual to 4 hours with the multi-agent pipeline. My opinion is simple: if one agent has to remember too many promises, I have already designed the workflow badly.

Designing Orchestrators And Subagents With The Anthropic Python SDK

I run the orchestration layer from Ubuntu 22.04 using Python 3.11 and Anthropic SDK 0.30. The orchestrator is not the smartest agent in my setup. It is more like a shift lead: assign narrow work, check budgets, validate handoffs, and refuse to improvise when a subagent returns ambiguous data.

In my environment, the orchestrator prompt contains the audit goal, allowed agent roles, budget limits, and output contracts. Each subagent receives a smaller prompt with one job and one schema. For example, the FortiOS collector gets firewall export paths, object naming conventions, and a JSON schema for policies. It does not get report tone, management language, or vulnerability scoring rules.

from anthropic import Anthropic

client = Anthropic()

AGENTS = {
    "collector": "Return raw evidence only. No conclusions.",
    "normalizer": "Convert evidence into the shared audit schema.",
    "reviewer": "Evaluate controls against approved criteria.",
    "writer": "Write the final narrative from reviewed findings only.",
}

def call_agent(role, task, context, max_tokens=1800):
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=max_tokens,
        temperature=0,
        system=AGENTS[role],
        messages=[
            {"role": "user", "content": f"TASK:\n{task}\n\nCONTEXT:\n{context}"}
        ],
    )
    return message.content[0].text

def run_audit_pipeline(inputs):
    raw = call_agent("collector", "Extract firewall and scanner evidence.", inputs)
    normalized = call_agent("normalizer", "Map raw evidence to audit schema.", raw)
    reviewed = call_agent("reviewer", "Find control gaps and cite evidence.", normalized)
    return call_agent("writer", "Draft the quarterly audit report.", reviewed, 2400)

Small prompts fail cleaner.

What I didn’t expect was how much better the system got when I made the orchestrator boring. The first version tried to repair bad subagent output, merge partial findings, and invent fallback logic. The current version rejects malformed output and retries once with the validation error attached. I prefer dumb orchestration with strict contracts over clever orchestration with hidden judgment.

Passing Context Without Flooding Every Agent

Context passing is where multi-agent systems either become useful or turn into expensive fog. I do not pass everything verbatim. I pass raw evidence only to agents that need raw evidence. I pass summaries only when the next agent needs decisions, counts, or relationships rather than original logs.

For firewall work, I pass FortiOS 7.4.3 policy rows verbatim to the normalizer because a single source, destination, or service field can change the finding. For executive report writing, I pass reviewed findings with evidence IDs, not the full firewall export. The writer needs confidence and traceability, not 6,000 lines of address objects.

  • I pass raw logs verbatim when field-level accuracy matters.
  • I pass normalized JSON when agents need consistent names across systems.
  • I pass summaries when the next step requires judgment, not extraction.
  • I pass evidence IDs with every finding so I can trace claims backward.
  • I strip credentials, session tokens, and irrelevant workstation names before handoff.

The best handoff is narrow.

I learned to treat context as a controlled interface, not a shared memory pool. Our manufacturing network has line controllers, jump hosts, wireless scanners, and vendor VPN accounts that all look meaningful when dumped into one giant context window. Most of them are irrelevant to a specific audit question. My opinion: context discipline is security discipline, especially when AI touches infrastructure evidence.

Controlling Cost And Token Budgets In Production

My most expensive mistake was embarrassingly simple. My orchestrator agent had no budget limits on subagent calls — a loop condition caused it to spawn 47 subagents and burn through API credits in 8 minutes. The root cause was a retry loop that treated every schema warning as a reason to launch a fresh reviewer instead of asking the same reviewer to repair one field.

I fixed that by adding hard caps in three places: max subagent calls per run, max retries per role, and max tokens per role. I also added a run ledger that records the intended task before any call is made. If the same role receives the same task hash twice, the orchestrator stops and asks for human review. That one guardrail caught two bad loops during testing.

You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.

Budgets are design constraints.

My production settings are conservative. The collector gets more tokens because raw evidence can be bulky. The reviewer gets fewer tokens but stricter output rules. The writer gets enough room for readable prose, but it cannot call tools or request new evidence. I also pin model names, SDK versions, and runtime versions in logs: claude-3-5-sonnet-20241022, Anthropic SDK 0.30, Python 3.11, and Ubuntu 22.04. Reproducibility matters more to me than squeezing one more clever answer out of a vague prompt.

I do not trust a multi-agent workflow until it fails predictably. If the Tenable export is missing, the collector should fail. If FortiOS 7.4.3 policies do not match the expected schema, the normalizer should fail. If a finding has no evidence ID, the writer should refuse it. My opinion: any AI automation that cannot say no is not production automation.

Debug Multi-Agent Flows By Following The Evidence

When a multi-agent workflow produces a bad result, I debug the handoffs first. I look for the first point where evidence changed shape without a clear reason. In our audit pipeline, most failures come from three places: over-compressed summaries, missing IDs, or agents that were allowed to infer instead of cite.

My logs are plain on purpose. Every subagent call records role, task hash, input token estimate, output token count, schema validation result, retry count, and evidence IDs touched. I do not log secrets, and I redact hostnames that expose vendor access paths. The goal is not surveillance of the model. The goal is reconstructing the chain of custody from raw firewall rule to final audit sentence.

Trace the first distortion.

I also keep sample fixtures from real but sanitized runs. One fixture contains a FortiOS 7.4.3 rule with an expired temporary exception. Another contains duplicate asset names from two production lines. Another contains a vulnerability scanner finding that looks severe until the compensating firewall rule is considered. These fixtures make regressions obvious when I change prompts, SDK settings, or schema fields.

The final check is still human. I read the findings that affect production segmentation, vendor remote access, and safety-adjacent systems before the report leaves my team. Multi-agent Claude Code workflows have saved us days of repetitive work, but I treat them as structured assistants, not autonomous auditors. My opinion after running this in a real manufacturing security environment: the power is not parallelism; the power is making every agent small enough to be held accountable.

Further Reading: For more in-depth information, refer to the official Fortinet Documentation.

The moment I made each agent responsible for less, the whole workflow became responsible for more.