Claude API in Production: Rate Limits, Cost Control, and Reliability Engineering

Claude API in Production: Rate Limits, Cost Control, and Reliability Engineering

When Our Tier Limits Started Showing Teeth

Our AI automation pipeline started hitting rate limit errors during business hours after we scaled from 3 tools to 11 tools in the same month. At 10:17 a.m. on a Tuesday, our ticket triage assistant, vendor-risk summarizer, firewall-change explainer, and SOP generator all started throwing 429 responses at once. I was watching logs on Ubuntu 22.04, running a Python 3.11 worker pool behind our internal automation gateway, and the pattern looked familiar: nothing was broken, but everything was waiting.

My first assumption was wrong. I thought we had a bad deployment or a hung queue consumer, so I rolled back a service that had nothing to do with the failure. The real issue was that our Claude API usage had crossed into a new operating pattern. We were no longer sending occasional requests from isolated tools. We were running a shared production dependency during the same hours our plant engineers, help desk, compliance team, and security analysts were online.

Limits are architecture.

For us, the practical lesson was that Anthropic API tiers were not just account metadata. Tier 3 changed what we could do, but it did not remove the need to shape traffic. We had to track requests per minute, tokens per minute, and spend limits separately because each one failed differently. Request limits hurt small chatty tools. Token limits hurt long document processing. Spend limits create a slower, uglier failure because nobody wants to find out at 2 p.m. that the automation budget has been silently burning since 6 a.m.

I now treat Claude API tier data the same way I treat FortiOS 7.4.3 firewall session tables: useful only when it is visible before the outage. My opinion is simple: if my production AI system does not know its current tier, remaining budget, and live throttle state, then it is not production-ready.

Why Prompt Caching Paid Back First

The fastest cost reduction came from prompt caching. We had long, stable system prompts describing our environment, safety rules, output schema, escalation paths, and manufacturing-specific terminology. Every request was resending that same context. That was waste hiding in plain sight, and I missed it during the first build because I was focused on response quality instead of request shape.

What I didn’t expect was how boring the fix would be. We marked the reusable system context as cacheable, kept volatile user input outside that block, and watched the usage graph flatten within two billing cycles. Monthly API costs dropped from $580 to $340 after implementing prompt caching and request batching. Prompt caching alone accounted for a 38% cut before batching finished the job.

Two lines changed the bill.

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=900,
    system=[
        {
            "type": "text",
            "text": SECURITY_AUTOMATION_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": incident_payload}
    ]
)

We had to be disciplined about what went into the cached section. Static policy guidance belonged there. Live ticket IDs, device names, user names, timestamps, and investigation notes did not. Once we made that separation, caching became predictable enough to trust. I also added a unit test that fails if dynamic incident fields leak into the cached prompt template.

The opinion I landed on: prompt caching should be the default for any Claude workflow with stable instructions, not an optimization project saved for later.

How I Fixed Retry Logic After Making It Worse

I built retry logic without exponential backoff. When we hit the rate limit, all retries fired simultaneously and made the situation worse for 12 minutes. That was my mistake, and it was the kind that looks reasonable in a code review until production traffic turns it into a synchronized hammer.

The bad version retried after a fixed two-second delay. Under normal transient errors, it looked fine. Under a real rate-limit event, 40 workers slept for the same two seconds, woke up together, retried together, failed together, and repeated the cycle. Our queue depth climbed from 118 jobs to 1,742 jobs before I disabled two lower-priority automations.

Retries need manners.

Our working pattern uses exponential backoff, jitter, a maximum retry window, and respect for server-provided retry hints when available. I also split retryable failures from permanent failures. A malformed prompt, bad JSON schema, or policy rejection should not loop just because the HTTP client sees an exception. Rate limits, connection resets, and certain 5xx responses can retry, but only with spreading behavior.

  • Use exponential backoff with random jitter for every retryable Claude API call.
  • Cap retries by elapsed time, not only by attempt count.
  • Honor retry-after headers before calculating my own delay.
  • Separate low-priority automation from analyst-facing workflows.
  • Log token usage, status code, model, and queue age on every failure.

After the fix, a rate-limit event degraded throughput instead of collapsing the pipeline. That distinction matters. I do not need every AI job to complete immediately, but I do need the system to fail slowly, visibly, and fairly.

Where Batching and Queuing Belong

Batching helped once we stopped pretending every request deserved real-time treatment. In our environment, a firewall-change explanation requested by an on-call engineer is interactive. A nightly batch of vendor questionnaire summaries is not. Treating both the same was lazy engineering.

You may also find this useful: Check out our guide on Python Network Config Backup: Automating Multi-Vendor Device Snapshots for more practical tips.

We moved high-volume AI workloads into queues with explicit priority lanes. The analyst-facing lane stayed small and fast. The bulk-processing lane absorbed document summaries, control-mapping drafts, and repetitive classification jobs. A scheduler released work based on current token pressure, business hours, and the age of queued jobs. Python 3.11 made the worker side straightforward, but the important design choice was organizational, not syntactic.

Queues reveal priorities.

Request batching worked best when inputs shared the same instruction set and response schema. We batched small classification tasks together, especially when the cached system prompt was identical. We did not batch incident narratives that needed independent traceability, because debugging one combined response at 3 a.m. is a tax I refuse to pay.

The before-and-after behavior was obvious. Before batching, our automation gateway averaged 64 Claude requests per minute during peak morning usage. After batching and queue shaping, that dropped to 29 requests per minute while total completed jobs increased because we stopped wasting request overhead on tiny prompts. My view is that batching is not about squeezing every penny out of the model; it is about making traffic smooth enough that humans can trust the system during busy hours.

Monitor Degradation Before Users Notice

Monitoring changed how we talked about reliability. At first, my dashboard only showed success and failure counts. That was not enough. A Claude workflow can be technically successful and still be operationally degraded if latency doubles, cached-token hit rate falls, queue age climbs, or fallback responses increase.

We now watch p50, p95, and p99 latency separately by tool. We track input tokens, output tokens, cached tokens, retry count, queue age, rejection rate, and cost per workflow. I also added a daily report that compares spend against expected plant activity, because shutdown weeks and audit weeks have very different baselines. The goal is not pretty graphs. The goal is knowing when the system is starting to bend.

Latency lies by itself.

The best alert we added was not for outright failure. It fires when p95 latency exceeds 18 seconds for analyst-facing workflows while queue age rises above 240 seconds. That combination catches trouble earlier than HTTP errors alone. We also alert when cached-token percentage drops below 55%, because that usually means someone changed a prompt template and accidentally pushed stable context into the dynamic section.

Reliability engineering for Claude API workloads feels closer to network operations than application development. I need capacity awareness, backpressure, failure isolation, and boring dashboards that tell the truth. My final opinion is blunt: if my team can monitor FortiOS 7.4.3 traffic patterns with discipline, we can monitor AI traffic with the same seriousness.

Further Reading: For more in-depth information, refer to the official Fortinet Documentation.

The model was not the fragile part; our traffic pattern was.