February 5, 202516 min

Making LLM Output Reliable

JSON mode, schema enforcement, validation pipelines, and retry strategies — the complete reliability layer that sits between the model and your downstream systems.

The Reliability Problem

In article two, you built a retry loop. In article three, you designed prompts with explicit output schemas. Those two things get you a long way. What they do not give you is reliability at scale.

At 100 requests/day, a 3% format failure rate means 3 bad responses. Acceptable if you have a fallback. At 100K requests/day, that same rate is 3,000 failures. If each failure reach a user as an error, that is 3,000 unfavorable moments per day, compounding over time.

This article is the complete reliability layer: schema enforcement at the API level, a validation pipeline that goes beyond JSON parsing, and retry strategies with more intelligence than "try again." Everything builds on the mental model from article one and the structures from articles two and three.

Level 1: JSON Mode and Structured Output APIs

Both OpenAI and Anthropic expose structured output modes that constrain the model at the token generation level — not just the instruction level.

The distinction matters. When you write "return only JSON" in a prompt, you have instructed the model. The model is a probabilistic system; it follows the instruction most of the time. At some frequency, it does not — and the failure mode is unpredictable (prose preamble, markdown fences, truncated JSON).

When you use JSON mode or structured output schemas, the model's token sampling is constrained so it cannot produce syntactically invalid JSON. This eliminates the most common class of parsing failures entirely.

OpenAI Structured Outputs

import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const SummarySchema = z.object({
  summary: z.string().min(10).max(500),
  wordCount: z.number().int().positive(),
  keyPoints: z.array(z.string()).min(1).max(5),
});

const client = new OpenAI();

const response = await client.beta.chat.completions.parse({
  model: "gpt-4o-mini",
  messages: [
    { role: "system", content: "You are a document summarizer." },
    { role: "user", content: documentText },
  ],
  response_format: zodResponseFormat(SummarySchema, "summary"),
});

const result = response.choices[0].message.parsed;
// result is fully typed: { summary: string; wordCount: number; keyPoints: string[] }

response.choices[0].message.parsed is already parsed and typed. json.loads() is gone. The structural validity guarantee is provided by the API.

Gotcha: structured outputs constrain the structure, not the content. The model will return valid JSON with the correct schema. It may still return a wordCount that does not match the actual word count of the summary field. Semantic validation is still your responsibility.

Anthropic Tool Calling as Schema Enforcement

Anthropic does not expose a direct JSON mode on its base message API, but tool calling achieves the same effect:

const response = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  tools: [
    {
      name: "return_summary",
      description: "Return the document summary in structured format.",
      input_schema: {
        type: "object",
        properties: {
          summary: { type: "string", description: "Document summary, 50–150 words" },
          wordCount: { type: "integer", description: "Word count of the summary" },
          keyPoints: {
            type: "array",
            items: { type: "string" },
            minItems: 1,
            maxItems: 5,
          },
        },
        required: ["summary", "wordCount", "keyPoints"],
      },
    },
  ],
  tool_choice: { type: "tool", name: "return_summary" },
  messages: [{ role: "user", content: documentText }],
});

const toolUse = response.content.find((c) => c.type === "tool_use");
if (!toolUse || toolUse.type !== "tool_use") {
  throw new Error("Model did not call the expected tool");
}

const result = toolUse.input as {
  summary: string;
  wordCount: number;
  keyPoints: string[];
};

tool_choice: { type: "tool", name: "return_summary" } forces the model to call that specific tool. The input_schema constrains the structure of the tool call arguments.

When to use tool calling vs. prompting: Use tools whenever you have a well-defined output schema. Prompting for JSON output should be a last resort or a fallback for models that do not support structured output modes.

Level 2: The Validation Pipeline

API-level schema enforcement eliminates structural failures. Semantic failures remain. The validation pipeline is where you catch them.

A validation pipeline is a sequence of checks that runs after every model response before the result leaves your system. Design it in layers:

Model Output
    │
    ▼
[Structural Check] ← Does the output parse? Are required fields present?
    │ pass
    ▼
[Type Check] ← Are field types correct? Arrays are arrays, numbers are numbers?
    │ pass
    ▼
[Range Check] ← Are values within acceptable bounds?
    │ pass
    ▼
[Business Logic Check] ← Are values consistent with domain rules?
    │ pass
    ▼
Accepted Output → Downstream Systems

Each layer has a distinct failure mode. Do not merge them into one function.

interface ValidationResult {
  valid: boolean;
  errors: string[];
  stage: "structural" | "type" | "range" | "business" | "none";
}

function validateSummary(data: unknown): ValidationResult {
  // Layer 1: Structural
  if (typeof data !== "object" || data === null) {
    return { valid: false, errors: ["Output is not an object"], stage: "structural" };
  }

  const d = data as Record<string, unknown>;

  // Layer 2: Type
  const typeErrors: string[] = [];
  if (typeof d.summary !== "string") typeErrors.push("summary must be a string");
  if (typeof d.wordCount !== "number") typeErrors.push("wordCount must be a number");
  if (!Array.isArray(d.keyPoints)) typeErrors.push("keyPoints must be an array");

  if (typeErrors.length > 0) {
    return { valid: false, errors: typeErrors, stage: "type" };
  }

  // Layer 3: Range
  const rangeErrors: string[] = [];
  const wordCountActual = (d.summary as string).split(/\s+/).filter(Boolean).length;

  if ((d.wordCount as number) < 10 || (d.wordCount as number) > 500) {
    rangeErrors.push(`wordCount ${d.wordCount} is outside acceptable range [10, 500]`);
  }
  if ((d.keyPoints as unknown[]).length < 1 || (d.keyPoints as unknown[]).length > 5) {
    rangeErrors.push(`keyPoints length ${(d.keyPoints as unknown[]).length} is outside [1, 5]`);
  }

  if (rangeErrors.length > 0) {
    return { valid: false, errors: rangeErrors, stage: "range" };
  }

  // Layer 4: Business logic
  const businessErrors: string[] = [];
  const deviation = Math.abs(wordCountActual - (d.wordCount as number));

  if (deviation > 10) {
    businessErrors.push(
      `wordCount field (${d.wordCount}) deviates from actual (${wordCountActual}) by ${deviation}`
    );
  }
  if ((d.keyPoints as string[]).some((p) => typeof p !== "string" || p.trim() === "")) {
    businessErrors.push("keyPoints contains empty or non-string entries");
  }

  if (businessErrors.length > 0) {
    return { valid: false, errors: businessErrors, stage: "business" };
  }

  return { valid: true, errors: [], stage: "none" };
}

The stage field is not cosmetic. It determines how the retry logic responds.

A failure at structural or type — retry with a reinforced format instruction. A failure at range — retry with an explicit constraint: "wordCount must be between 10 and 500." A failure at business — the model produced something internally inconsistent. Lower temperature on retry and add an explicit instruction about the inconsistency.

Level 3: Retry Strategies by Failure Type

Article two introduced the concept of different retry behaviors per error type. Here is the full decision tree.

type RetryStrategy =
  | { action: "retry_same" }
  | { action: "retry_modified"; addition: string }
  | { action: "retry_reduced_temp" }
  | { action: "fallback_model"; model: string }
  | { action: "fail" };

function determineRetry(
  validation: ValidationResult,
  attempt: number,
  maxAttempts: number
): RetryStrategy {
  if (attempt >= maxAttempts) return { action: "fail" };

  switch (validation.stage) {
    case "structural":
    case "type":
      return {
        action: "retry_modified",
        addition:
          "Your previous response did not match the required structure. " +
          "Return ONLY the JSON object with the exact fields: summary (string), " +
          "wordCount (integer), keyPoints (array of strings). No other text.",
      };

    case "range":
      return {
        action: "retry_modified",
        addition: `Constraint violated: ${validation.errors.join("; ")}. ` +
          "Adjust your response to meet these constraints.",
      };

    case "business":
      return {
        action: "retry_reduced_temp",
      };

    default:
      return { action: "retry_same" };
  }
}

And the retry loop that uses this:

async function callModelWithFullRetry(
  text: string,
  options: { maxAttempts?: number; temperature?: number } = {}
): Promise<SummaryResult> {
  const maxAttempts = options.maxAttempts ?? 3;
  let temperature = options.temperature ?? 0.7;
  let promptAddition = "";
  let lastError: string = "unknown";

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    let raw: unknown;

    try {
      raw = await callModel(text + promptAddition, { temperature });
    } catch (err) {
      // Network/transient failures: delay and retry
      lastError = err instanceof Error ? err.message : String(err);
      if (attempt < maxAttempts) {
        await delay(500 * Math.pow(2, attempt - 1));
        continue;
      }
      break;
    }

    const validation = validateSummary(raw);

    if (validation.valid) {
      if (attempt > 1) {
        // Log successful retries for classifier tuning
        logger.info({ event: "retry_succeeded", attempt, stage: validation.stage });
      }
      return raw as SummaryResult;
    }

    lastError = validation.errors.join("; ");
    const strategy = determineRetry(validation, attempt, maxAttempts);

    switch (strategy.action) {
      case "retry_modified":
        promptAddition = "\n\n" + strategy.addition;
        break;
      case "retry_reduced_temp":
        temperature = Math.max(0.1, temperature - 0.3);
        break;
      case "fail":
        break;
      case "retry_same":
        await delay(300);
        break;
    }

    if (strategy.action === "fail") break;
  }

  throw new Error(`Model call failed after ${maxAttempts} attempts. Last error: ${lastError}`);
}

Key behaviors:

Temperature reduction for business logic failures. When the model's output is structurally correct but internally inconsistent, lower temperature pushes it toward its highest-probability output, which is usually the most consistent one. This is a real signal, not a guess.
Prompt augmentation is additive. The addition is appended to the original prompt. On each retry, the model sees the original instructions plus the correction feedback. This prevents instructions from becoming unwieldy on repeated failures.
Transient retries use exponential backoff. An immediate retry under a rate limit makes the situation worse. 500ms → 1s → 2s is a safe pattern.
Log successful retries. A retry that eventually succeeds is still a failure signal. If your retry success rate is 8%, your initial prompt is not constraining the model tightly enough.

The Circuit Breaker

High retry rates cause a cascade: you are already paying 3–6x the token cost per request, and if many requests are retrying in parallel, you amplify load on the model API, potentially causing more rate limit errors, which cause more retries.

A circuit breaker breaks this loop:

class CircuitBreaker {
  private failureCount = 0;
  private lastFailureTime = 0;
  private state: "closed" | "open" | "half-open" = "closed";

  constructor(
    private readonly failureThreshold: number = 5,
    private readonly recoveryTimeMs: number = 30_000
  ) {}

  isOpen(): boolean {
    if (this.state === "open") {
      const timeSinceFailure = Date.now() - this.lastFailureTime;
      if (timeSinceFailure > this.recoveryTimeMs) {
        this.state = "half-open";
        return false;
      }
      return true;
    }
    return false;
  }

  recordSuccess(): void {
    this.failureCount = 0;
    this.state = "closed";
  }

  recordFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold) {
      this.state = "open";
    }
  }
}

const breaker = new CircuitBreaker(5, 30_000);

async function callModelWithCircuitBreaker(prompt: string): Promise<unknown> {
  if (breaker.isOpen()) {
    throw new Error("Circuit breaker open: model calls suspended");
  }

  try {
    const result = await callModel(prompt);
    breaker.recordSuccess();
    return result;
  } catch (err) {
    breaker.recordFailure();
    throw err;
  }
}

The circuit breaker activates after five consecutive failures and stays open for 30 seconds. During that window, calls fail immediately without hitting the model API. This prevents retry storms and gives the upstream service time to recover.

When to set the threshold: Five failures is reasonable for a single instance. If you run multiple instances, a single circuit breaker per instance is fine — you do not need distributed circuit breaker state for most use cases. A shared rate limit counter (Redis-based) is a separate concern.

Semantic Validation: Beyond Structure

Structural validation catches format failures. It does not catch content failures — cases where the model produces a valid, well-structured response that is factually wrong or off-task.

For simple outputs, structural validation is usually sufficient. For higher-stakes outputs (legal, medical, financial domains), you add semantic validation:

Option 1: Rule-based semantic checks

function validateSemanticContent(
  summary: SummaryResult,
  sourceDocument: string
): string[] {
  const errors: string[] = [];
  const sourceWords = new Set(sourceDocument.toLowerCase().split(/\W+/));

  // Check: summary does not introduce entities not in source
  const properNouns = summary.summary.match(/\b[A-Z][a-z]+\b/g) ?? [];
  for (const noun of properNouns) {
    if (!sourceWords.has(noun.toLowerCase())) {
      errors.push(`Potential hallucination: "${noun}" not found in source document`);
    }
  }

  return errors;
}

This is imperfect — it will have false positives (capitalized words at sentence starts). Tune the heuristic for your domain.

Option 2: LLM-as-judge

const judgePrompt = `You are a quality assessor for document summaries.

SOURCE DOCUMENT:
${sourceDocument}

GENERATED SUMMARY:
${summary.summary}

Assess: Does the summary contain any claims not supported by the source document?
Return JSON: {"verdict": "pass"|"fail", "issues": ["<issue>", ...]}`;

const judgment = await callModel(judgePrompt, { temperature: 0 });

LLM-as-judge adds cost and latency. Use it for high-stakes tasks, not bulk processing. At temperature 0, the judge model is highly consistent within a session. Across model versions, calibrate the judge periodically against a ground truth dataset.

The Failure Walkthrough: Broken JSON

Scenario: A structured extraction service classifies documents as contract, invoice, or correspondence and extracts a key date. The model is GPT-3.5 turbo (pre-structured-outputs switch).

Observed in production:

✅ {"type": "contract", "date": "2025-01-15"} — 81%
⚠️  ```json\n{"type": "contract", "date": "2025-01-15"}\n``` — 11%
⚠️  {"type": "contract", "date": "2025-01-15", "notes": "payment terms 30 days"} — 5%
❌  {"type": "contract", "date": January 15} — 2% (date not quoted → parse error)
❌  This document is a contract dated January 15th, 2025. — 1%

Four failure modes, four different causes.

The markdown fence case (11%): Model trained on vast amounts of markdown follows the pattern of wrapping code in fences. Your instruction says "return only JSON" but the model treats it as prose context, not a hard constraint. Fix: use JSON mode (OpenAI) or tool calling (Anthropic), or add the regex extraction fallback from article two.

The extra fields case (5%): Model is being "helpful" by including additional information it found. Fix: add an explicit constraint — "Return only the fields in the schema. Do not add extra fields." Also validates on the consumer side by picking only the fields you need.

The unquoted value (2%): Model outputs a date without quotes, producing syntactically invalid JSON. Happens most on dates, numbers-that-look-like-text, and boolean-like strings. Fix: structured output mode eliminates this entirely. Alternatively, add explicit format instruction: "date must be a string in ISO 8601 format: "YYYY-MM-DD"."

The prose response (1%): The model ignored all format instructions for this input. The input was likely a very short one-line document. The model, seeing minimal content, defaulted to a natural-language response. Fix: add a minimum document length check. If the input is fewer than 20 words, return an error rather than calling the model.

Putting It Together: Reliability Layer Checklist

Before any structured output feature ships to production:

[ ] API-level structured output mode is active (JSON mode / tool calling / response format)
[ ] Validation pipeline runs on every response: structural → type → range → business
[ ] Each validation failure stage maps to a distinct retry strategy
[ ] Retry loop has a hard maximum (≤3 attempts) with exponential backoff on transient failures
[ ] Prompt modification is additive on retry (original instructions preserved)
[ ] Circuit breaker is in place for model call failures
[ ] All validation failures are logged with stage and error detail
[ ] Retry rate is tracked as a metric — alert if above 5%
[ ] Semantic validation is included for higher-stakes domains

What Is Next

You now have a reliability layer. Your system produces structurally correct output most of the time and recovers gracefully when it does not.

The remaining problem: correctness. The output is correctly formatted, but is it right? In article five, we look at hallucination — why the model confidently states false things, what your reliability layer cannot catch, and what you have to build to mitigate it.

← All articles