From Toy to Production: What Breaks First
The systems that work perfectly in staging fail in production for reasons that are never about the model. Rate limits, inconsistent outputs at scale, state management, and graceful degradation — what actually breaks and how to engineer around it.
Staging Is Not Production
You have built the reliability layer (articles two through four). You have mitigated hallucinations (five). Your RAG system retrieves correctly (six through seven). Your eval suite passes at 93% (eight). Cost is within budget (nine). Deploy.
The first three days in production teach you things staging never could.
This is not a failure of process. Staging cannot replicate production because production has characteristics that are fundamentally difficult to reproduce: real traffic distributions, concurrent users, sustained load over time, adversarial inputs, and interactions between components that only emerge at scale.
This article covers what breaks first, why it breaks, and what you build to handle it. Think of it as the runtime failure taxonomy — the class of failures that appear after the feature ships, not before.
Failure Class 1: Rate Limits
LLM APIs enforce rate limits on tokens per minute and requests per minute. In development, you make requests sequentially, well under any limit. In production, concurrent users trigger concurrent requests. Spikes are common and unpredictable.
What you see:
HTTP 429 Too Many Requests
Retry-After: 20
Without rate limit handling, these become 500s to users. With a naive retry (immediate, no backoff), they amplify the problem — you respond to a 429 by hammering the API with the same requests, which triggers more 429s.
What you build:
import Bottleneck from "bottleneck"; // Or implement manually
// Rate limiter: max 50 requests/minute, 90K tokens/minute
const limiter = new Bottleneck({
minTime: 1200, // ms between requests (50 req/min = 1200ms)
reservoir: 50, // max 50 requests
reservoirRefreshAmount: 50,
reservoirRefreshInterval: 60_000,
});
async function callModelRateLimited(prompt: string): Promise<string> {
return limiter.schedule(() => callModelRaw(prompt));
}
This queues requests and spaces them to stay within your limit. Requests beyond what can be handled within the client's timeout are queued or rejected at the application layer rather than forwarded to the API to fail.
For higher-throughput workloads:
Use a token bucket implementation that tracks token usage, not just request count. Count estimated input + output tokens per request and stop issuing new requests when within 10% of your token per minute limit:
class TokenBucketLimiter {
private tokenCount = 0;
private readonly limit: number;
private lastRefillTime = Date.now();
constructor(tokensPerMinute: number) {
this.limit = tokensPerMinute;
}
async acquire(estimatedTokens: number): Promise<void> {
const now = Date.now();
const elapsed = now - this.lastRefillTime;
if (elapsed >= 60_000) {
this.tokenCount = 0;
this.lastRefillTime = now;
}
if (this.tokenCount + estimatedTokens > this.limit * 0.9) {
const waitMs = 60_000 - elapsed + 1_000; // wait for reset + buffer
await new Promise((r) => setTimeout(r, waitMs));
this.tokenCount = 0;
this.lastRefillTime = Date.now();
}
this.tokenCount += estimatedTokens;
}
}
Monitoring: Track 429_rate (fraction of requests that hit rate limits). If above 2%, you need higher tier API limits or request queuing for the burst traffic pattern.
Failure Class 2: Output Drift
You deployed your feature. Eval suite passes at 93%. Two weeks pass. You run the eval again: 79%. Nothing changed in your code.
What changed: the model was updated. LLM providers update models silently. The same model name (claude-sonnet-4-5, gpt-4o) can point to different model versions over time, and each version can produce meaningfully different outputs for the same prompt.
Symptoms:
- Validation failure rate increases gradually over days
- Some test cases that previously passed now fail
- Retry rate increases
What you build:
First, detect it. Run your eval suite on a schedule — daily is not too frequent. Alert on pass rate drops of more than 3 percentage points from the rolling 7-day average.
Second, track model versions explicitly:
async function callModel(prompt: string): Promise<ModelResponse> {
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
// Log model version if available (Anthropic includes it in response)
logger.info({
event: "model_call",
model: response.model, // Often includes version suffix
inputTokens: response.usage.input_tokens,
outputTokens: response.usage.output_tokens,
});
return response;
}
When an eval regression appears, check whether the model field in your logs changed. If it did, that is your root cause. If it did not, the failure is in your code or data.
Third, pin model versions when precise behavior is critical:
// Use specific version strings when available
const MODEL_ID = process.env.MODEL_ID ?? "claude-sonnet-4-5-20251201";
Pinning prevents silent updates. The downside: you miss improvements and eventually the pinned version is deprecated. Treat model version updates as a deployment event: run eval suite before and after, confirm pass rate, then promote.
Failure Class 3: Inconsistent Outputs at Scale
In staging, you tested 50 inputs. In production, day one brings 30,000 inputs. The long tail of your input distribution includes patterns you never anticipated.
What happens:
Production input: "Can u help w/ reseting my pw? thx"
Your prompt expects: formal, complete sentences in English
Model behavior: interprets lax text, produced inconsistent output
Validation: passes (JSON is structurally correct)
What user sees: technically correct JSON, but tonally wrong output
The model handles informal input differently than your prompt examples suggested. Not a failure — it passed validation — but quality is noticeably lower.
What you build:
First, input normalization. Before sending to the model, standardize the input:
function normalizeInput(text: string): string {
return text
.replace(/\s+/g, " ") // Collapse whitespace
.trim()
.slice(0, MAX_INPUT_CHARS); // Enforce length limit
// Do NOT correct spelling — this can change meaning
}
Second, input distribution logging. Log the length and basic statistics of every input:
function logInputProfile(input: string): void {
const words = input.split(/\s+/).length;
const hasCode = /```|def |function |class /.test(input);
const language = detectLanguage(input); // Optional: langdetect library
logger.info({
event: "input_profile",
charLength: input.length,
wordCount: words,
hasCode,
language,
});
}
When quality issues appear, pull input_profile logs and compare the input characteristics of failing vs. passing cases. You will find clusters — "inputs over 2,000 characters have 15% higher failure rate," "non-English inputs always fail step 3."
Third, add input type-specific test cases to your eval suite whenever you identify a new cluster in production failures.
Failure Class 4: State and Conversation Management
For stateless features (classify this document, summarize this text), state is not a concern. For conversational features, it becomes a significant failure surface.
The failure: You need the last N turns of conversation history. You store it in memory per server instance. With three server instances, a user on instance 1 for turn 1 is on instance 2 for turn 2 and sees no history. Outputs become contextless and confused.
What you build:
Store conversation state in a shared service (Redis, database), not in server memory:
interface ConversationTurn {
role: "user" | "assistant";
content: string;
timestamp: string;
}
async function getConversationHistory(
conversationId: string,
maxTurns: number = 10
): Promise<ConversationTurn[]> {
const history = await redis.lrange(`conv:${conversationId}`, -maxTurns * 2, -1);
return history.map((h) => JSON.parse(h) as ConversationTurn);
}
async function appendToConversation(
conversationId: string,
turn: ConversationTurn,
ttlSeconds: number = 3600
): Promise<void> {
await redis.rpush(`conv:${conversationId}`, JSON.stringify(turn));
await redis.expire(`conv:${conversationId}`, ttlSeconds);
}
The token budget problem: A 10-turn conversation with 400 tokens per turn is 4,000 tokens of history in the context. At 100 turns (a long session), it is 40,000 tokens. This exceeds budgets and degrades quality.
Implement a windowed history with summarization:
async function getWindowedHistory(
conversationId: string,
tokenBudget: number = 3000
): Promise<ConversationTurn[]> {
const allHistory = await getConversationHistory(conversationId, 50);
// Fit as many recent turns as possible within the token budget
let tokenCount = 0;
const included: ConversationTurn[] = [];
for (let i = allHistory.length - 1; i >= 0; i--) {
const turnTokens = estimateTokens(allHistory[i].content);
if (tokenCount + turnTokens > tokenBudget) break;
included.unshift(allHistory[i]); // Add to front (oldest first)
tokenCount += turnTokens;
}
return included;
}
Trim from the oldest end. Recent context is more relevant than old context.
Failure Class 5: Cascade Failures
An LLM call is a network request to an external service. External services go down. When the model API is degraded or unavailable, every request that depends on it fails. If you have no fallback, the outage takes down your entire feature.
Cascades compound. If your AI feature is in a critical path (a classifier that routes support tickets, an extractor that normalizes incoming payments), downstream services that depend on it fail too. One model API incident becomes a multi-team incident.
What you build:
Graceful degradation — a defined behavior for every AI-dependent path when the model is unavailable:
type DegradedResponse =
| { strategy: "queue"; queueId: string } // Process async when service recovers
| { strategy: "default"; value: unknown } // Return a safe default value
| { strategy: "error"; message: string } // Fail explicitly, don't hide the failure
| { strategy: "cached"; value: unknown }; // Return a recently cached result
async function callWithGracefulDegradation<T>(
modelCall: () => Promise<T>,
fallback: () => DegradedResponse
): Promise<T | DegradedResponse> {
if (circuitBreaker.isOpen()) {
logger.warn({ event: "degraded_response", reason: "circuit_open" });
return fallback();
}
try {
const result = await modelCall();
circuitBreaker.recordSuccess();
return result;
} catch (err) {
circuitBreaker.recordFailure();
logger.error({ event: "model_call_failed", error: String(err) });
return fallback();
}
}
Define the fallback strategy per feature. For a summarization feature, returning a cached summary (if available) or a "summary unavailable" message is acceptable. For a fraud detection classifier, "classify as 'review'" (the safest default) is better than an error.
The queue strategy for async jobs:
async function classifyDocumentWithFallback(documentId: string): Promise<string> {
const result = await callWithGracefulDegradation(
() => classifyDocument(documentId),
() => ({
strategy: "queue" as const,
queueId: `reprocess-${documentId}-${Date.now()}`
})
);
if ("strategy" in result && result.strategy === "queue") {
// Queue for reprocessing when service recovers
await reprocessQueue.push({ documentId, queueId: result.queueId });
return "pending"; // Signal that classification is deferred
}
return result as string;
}
The queue-and-reprocess pattern decouples the feature's availability from the model API's availability. When the API recovers, the queue drains and documents are classified in order. No data is lost.
Failure Class 6: Load-Induced Instability
You simulated 10 concurrent requests in staging. Production peaks at 200 concurrent requests. The model API starts returning 503s (overloaded). Your retry logic retries. Now you have 200 × 3 = 600 near-simultaneous requests. The 503 rate increases. Connection pool exhaustion starts. Request queues pile up. P99 latency goes from 2 seconds to 45 seconds.
Simulate load before you hit it:
// Simple load test: N concurrent requests, M total
async function loadTest(
concurrency: number,
totalRequests: number,
requestFn: () => Promise<unknown>
): Promise<{ successRate: number; p99Ms: number }> {
const latencies: number[] = [];
let successes = 0;
let completed = 0;
const semaphore = new Bottleneck({ maxConcurrent: concurrency });
const requests = Array.from({ length: totalRequests }, (_, i) =>
semaphore.schedule(async () => {
const start = Date.now();
try {
await requestFn();
successes++;
latencies.push(Date.now() - start);
} catch {
latencies.push(Date.now() - start);
} finally {
completed++;
}
})
);
await Promise.all(requests);
latencies.sort((a, b) => a - b);
const p99 = latencies[Math.floor(latencies.length * 0.99)];
return {
successRate: successes / totalRequests,
p99Ms: p99,
};
}
Run this at concurrency levels of 10, 25, 50, 100 before launch. Look for the inflection point where success rate starts dropping or p99 climbs sharply. That is your effective throughput ceiling. Set your request queue limit slightly below it.
Under load, retries become the enemy. At 100 concurrent requests with a 5% failure rate, you have 5 failing requests that each retry 3 times = 15 additional requests, raising effective concurrency to 115 → higher failure rate → more retries. Classic retry storm.
Fix: implement backpressure at the application layer. If your request queue exceeds a threshold, reject new requests with 429 instead of queuing them:
const REQUEST_QUEUE_LIMIT = 50;
async function handleRequest(req: Request, res: Response): Promise<void> {
if (requestQueue.size >= REQUEST_QUEUE_LIMIT) {
res.status(429).json({ error: "Service at capacity. Please retry in a few seconds." });
metrics.increment("request.shed");
return;
}
// Normal processing...
}
Shedding requests gracefully is better than accepting them and failing them silently after 30 seconds.
What Goes in Your Production Runbook
Every AI feature should have a runbook that answers these questions before launch:
- What behavior does the user see if the model API is down? Define it. Test it.
- What is your rate limit? What happens when you hit it? Verify the limiter is in place.
- What is your p95 latency under expected peak load? Measure it, do not guess.
- How do you detect model version drift? Confirm eval suite is running on a schedule.
- What is the fallback for every AI-dependent path? One per feature.
- What is your circuit breaker threshold? Test manual circuit open/close scenarios.
These questions are not difficult. The answers are usually straightforward once you have built the infrastructure. The danger is shipping without them.
What Is Next
Articles two through ten have built the feature-level reliability layer. Article eleven zooms out: how do you design AI systems — not individual features, but pipelines and agents — that are maintainable, debuggable, and correctly scoped? The architectural choices that separate a thoughtful system from a pile of model calls.