April 2, 202517 min

Debugging AI Systems (Your First Real Struggle)

When something goes wrong in a multi-component AI system, where do you start? Tracing prompt-to-output, identifying failure source, structured logging, and the systematic method that beats guessing every time.

The Debugging Problem Is Different

In deterministic systems, debugging has a clear structure. An unexpected output points to a code path. You read the code, find the logic error, fix it. Reproducibility is a given — run the same input and get the same wrong output every time.

AI system debugging has three properties that break this structure.

Non-determinism. The same input at temperature > 0 can produce a different output every time. A failure that occurs 10% of the time cannot be reproduced on demand. "Run it again" is not a debugging strategy.

Multi-stage pipelines. A bad final output could be caused by any stage: a prompt that misguides the model, a retrieval step that fetches wrong context, a validator that lets bad output through, a postprocessor that corrupts a valid response. The failure point is not obvious.

Opaque internals. You cannot inspect the model's "reasoning." You can observe inputs and outputs. Everything in between is a black box.

This article is the systematic method. Not heuristics — a structured process that isolates failure sources reliably.

The Core Method: Trace, Not Guess

The failure in most AI debugging efforts: developers look at the bad output and try to intuit the cause. "The model must be confused by X." This is guessing. It is sometimes right. It wastes hours when it is wrong.

The alternative: build observability that makes the cause visible, then read the evidence.

Every request that passes through your system should produce a trace. A trace is a record of the full state at each pipeline stage:

interface StageTrace {
  stageName: string;
  input: unknown;
  output: unknown;
  latencyMs: number;
  tokensUsed?: { input: number; output: number };
  error?: string;
}

interface RequestTrace {
  requestId: string;
  timestamp: string;
  stages: StageTrace[];
  finalOutput: unknown;
  overallLatencyMs: number;
  success: boolean;
}

When you investigate a failure, you pull the trace for that request and read it stage by stage. The failure is at the first stage where the output deviates from correct.

class Tracer {
  private stages: StageTrace[] = [];
  private readonly requestId: string;
  private readonly start: number;

  constructor(requestId: string) {
    this.requestId = requestId;
    this.start = Date.now();
  }

  async trace<T>(
    stageName: string,
    fn: () => Promise<T>
  ): Promise<T> {
    const stageStart = Date.now();
    try {
      const output = await fn();
      this.stages.push({
        stageName,
        input: null, // Populated by wrapper
        output,
        latencyMs: Date.now() - stageStart,
      });
      return output;
    } catch (err) {
      this.stages.push({
        stageName,
        input: null,
        output: null,
        latencyMs: Date.now() - stageStart,
        error: err instanceof Error ? err.message : String(err),
      });
      throw err;
    }
  }

  flush(success: boolean, finalOutput: unknown): RequestTrace {
    const trace: RequestTrace = {
      requestId: this.requestId,
      timestamp: new Date().toISOString(),
      stages: this.stages,
      finalOutput,
      overallLatencyMs: Date.now() - this.start,
      success,
    };
    logger.info({ event: "request.trace", ...trace });
    return trace;
  }
}

Usage in a RAG pipeline:

async function handleQuery(query: string, requestId: string): Promise<string> {
  const tracer = new Tracer(requestId);
  let finalOutput = null;

  try {
    const chunks = await tracer.trace("retrieval", () => retrieve(query));
    const context = await tracer.trace("context_build", () =>
      Promise.resolve(buildContext(chunks))
    );
    const answer = await tracer.trace("model_call", () => callModel(context, query));
    const validated = await tracer.trace("validation", () =>
      Promise.resolve(validateAnswer(answer))
    );

    finalOutput = validated;
    tracer.flush(true, finalOutput);
    return validated;
  } catch (err) {
    tracer.flush(false, null);
    throw err;
  }
}

The Failure Taxonomy: Where Things Break

Map every failure you encounter to one of these categories. Once categorized, the investigation path is determined.

Category 1: Input Failure

The input itself is the problem. Malformed, empty, out-of-domain.

Symptoms: Failure on a subset of inputs with no obvious pattern in the output. Other inputs work fine.

Diagnosis: Log and profile all inputs (article ten covered this). Find the statistical differences between failing inputs and passing ones:

async function profileFailureVsSuccess(): Promise<void> {
  const failing = await db.query(`
    SELECT input_char_length, has_code_block, language, input_type
    FROM request_logs
    WHERE success = false AND created_at > NOW() - INTERVAL '24 hours'
  `);

  const passing = await db.query(`
    SELECT input_char_length, has_code_block, language, input_type
    FROM request_logs
    WHERE success = true AND created_at > NOW() - INTERVAL '24 hours'
  `);

  // Compute distribution differences
  console.log("Avg length (failing)", average(failing.rows.map((r) => r.input_char_length)));
  console.log("Avg length (passing)", average(passing.rows.map((r) => r.input_char_length)));
}

If failing inputs have a statistically distinct characteristic (length, presence of code, detected language), you have your root cause.

Fix: Add preprocessing to handle the problematic input class, or add an explicit constraint to your prompt that addresses it, or reject the input with an informative error.

Category 2: Retrieval Failure (RAG)

The model's answer is wrong because the retrieved context is wrong — irrelevant, missing the relevant chunk, or stale.

Symptoms: Good answers on questions about popular topics, bad answers on niche or recent ones. Answers cite information from unrelated sources.

Diagnosis: Pull the retrieval stage from the trace. Look at:

topSimilarity: if below 0.80, retrieval quality is poor
docsRetrieved: if 0 or 1, the query has no close match in your knowledge base
Which documents were retrieved: do they contain the answer?

async function diagnoseRetrievalFailure(
  query: string,
  traceId: string
): Promise<void> {
  const stage = await getTraceStage(traceId, "retrieval");
  const docs = stage.output as RetrievedDocument[];

  console.log(`Top similarity: ${docs[0]?.similarity ?? 0}`);
  console.log(`Docs retrieved: ${docs.length}`);
  docs.forEach((doc, i) => {
    console.log(`\n--- Doc ${i + 1} (sim: ${doc.similarity}) ---`);
    console.log(doc.content.slice(0, 300));
  });

  // Does any retrieved doc contain the expected answer?
  const expected = "..."; // What you expected the answer to be about
  const coverageCheck = docs.some((d) => d.content.includes(expected));
  console.log(`Expected content in retrieved docs: ${coverageCheck}`);
}

Fix paths:

Low similarity: the knowledge base lacks coverage — add the relevant documents
Wrong documents retrieved: adjust chunking or add metadata filtering (articles six and seven)
Stale content: re-index and add a freshness check

Category 3: Prompt Failure

The retrieved context is correct, but the prompt misguides the model away from the available answer.

Symptoms: The trace shows correct retrieval, but the model's output does not reflect the retrieved content. The answer is wrong even though the right information was in the context.

Diagnosis: Extract the exact prompt that was sent to the model from the trace (this is why you log full prompts, not just metadata):

// Log full prompts during debugging; redact PII in production
logger.debug({
  event: "prompt_debug",
  requestId,
  fullSystemPrompt: systemPrompt,
  fullUserPrompt: userPrompt.slice(0, 5000), // Truncate if very long
  contextChunks: retrievedChunks.map((c) => ({ id: c.id, similarity: c.similarity })),
});

With the full prompt in hand, test it in isolation:

Copy the exact prompt from the trace.
Send it to the model directly (using the API playground or a REPL).
Run it 5–10 times. Observe the distribution of outputs.
If the model consistently ignores the correct context: the prompt constraint is insufficient.
If the model sometimes gets it right: temperature or non-determinism issue.

Fix: Apply the prompt engineering principles from article three. Add an explicit constraint that covers the failure case.

Category 4: Validation Failure

The model produced a correct answer but the validation layer rejected it.

Symptoms: High retry rate, but on investigation the retried output is actually correct. Users report errors on responses that seem fine.

Diagnosis: Pull validation failures from logs. Check if they consistently fail on the same check:

const failuresByCheck = await db.query(`
  SELECT check_name, COUNT(*) as count
  FROM validation_failures
  WHERE created_at > NOW() - INTERVAL '7 days'
  GROUP BY check_name
  ORDER BY count DESC
`);

If 80% of validation failures are on wordCount_range_check and the produced summaries are actually appropriate length, your range bounds are too tight.

Fix: Relax the overly strict check, or fix it to correctly implement the intended constraint. This is a false positive in your validation layer, not a model failure.

Category 5: Postprocessing Failure

The model returned a valid, correct response, but your postprocessing code corrupted it.

Symptoms: The raw model output (in the trace) looks correct. The final output delivered to the user is wrong.

Diagnosis: Compare the model_call stage output to the final output in the trace. If they differ, the postprocessor introduced the error.

// Common culprits: JSON.parse, regex extraction, field mapping
// Always trace the raw model output separately from the parsed output
logger.debug({
  event: "postprocess_debug",
  requestId,
  rawModelOutput: rawOutput,
  parsedResult: parsedOutput,
  parsingMethod: "json_parse" | "regex_extract" | "structured_output",
});

Postprocessing bugs are ordinary code bugs, fully reproducible and debuggable with standard tools.

The Debugging Workflow

When a failure is reported, follow this process in order. Do not skip steps.

Step 1: Find the trace.

failure_report → requestId → pull RequestTrace from logs

If you do not have a requestId from the failure report, search by input content, user ID, or timestamp range.

Step 2: Identify the failing stage.

Walk the trace stages in order. At each stage, ask: "Is this output correct given this input?" The first stage where the answer is "no" is the failure source.

Step 3: Reproduce in the failing stage.

// Reproduce the failing stage with its exact inputs from the trace
async function reproduceFailure(trace: RequestTrace, stageName: string): Promise<void> {
  const failingStage = trace.stages.find((s) => s.stageName === stageName);
  if (!failingStage) throw new Error(`Stage ${stageName} not in trace`);

  // Run the stage function directly with the traced input
  const result = await runStageByName(stageName, failingStage.input);
  console.log("Reproduced output:", result);
}

Running the specific stage in isolation confirms the failure is reproducible at that stage.

Step 4: Determine failure category.

Use the taxonomy above. The stage name and the nature of the failure (wrong output, exception, validation failure) usually determine the category.

Step 5: Apply the fix and add a test case.

Every debugging session should add at least one test case to your eval suite. The failing input is now a known failure mode. Add it:

{
  id: `regression-${Date.now()}`,
  description: "Regression: query about product SSO when docs predate the change",
  input: failingInput,
  checks: [
    (output) => {
      // The expected correct behavior after the fix
      return output.answer === null || output.answer.includes("not available");
    }
  ]
}

The test case prevents the same failure from silently re-entering the system.

Structured Logging That Actually Helps

Good logs make debugging a 10-minute investigation instead of a 4-hour one. Bad logs make debugging a process of reading thousands of irrelevant lines.

What every log entry needs:

interface LogEntry {
  timestamp: string;   // ISO 8601, always UTC
  requestId: string;   // Unique per request — makes filtering trivial
  event: string;       // namespaced: "retrieval.chunk_fetched", "validation.failed"
  // Event-specific fields below
}

The requestId is the single most important field. Every log entry in a request — from the first HTTP handler to the last postprocessor — should include the same requestId. This lets you view the entire lifecycle of one request in chronological order:

requestId="req-abc123" event="http.received" latencyMs=0
requestId="req-abc123" event="retrieval.started"
requestId="req-abc123" event="retrieval.complete" docsRetrieved=5 topSimilarity=0.84
requestId="req-abc123" event="model_call.started" inputTokens=1240
requestId="req-abc123" event="model_call.complete" outputTokens=180 latencyMs=1840
requestId="req-abc123" event="validation.failed" stage="range" error="wordCount=12 below minimum 20"
requestId="req-abc123" event="retry.started" attempt=2
requestId="req-abc123" event="model_call.complete" outputTokens=220 latencyMs=1620
requestId="req-abc123" event="validation.pass"
requestId="req-abc123" event="http.response" status=200 totalLatencyMs=4100

With this log structure, debugging proceeds like reading a story. You see every event, in order, with all relevant data.

What not to log:

Full user inputs in production (PII risk) — log length, type, hash for correlation
Full model prompts in production at INFO level — log at DEBUG with a flag that can be enabled during debugging sessions
Redundant state — if you log retrieval.started with the query, do not also log the query again in model_call.started

Observability Checklist

Before any AI feature goes to production:

[ ] Each request gets a unique requestId that propagates through every log entry
[ ] All pipeline stages are traced (input, output, latency)
[ ] Raw model output is logged separately from parsed/validated output
[ ] Validation failures are logged with stage name and error detail
[ ] Retry events are logged with attempt number and reason
[ ] Cost data (input tokens, output tokens) is logged per request
[ ] Retrieval quality signals (top similarity, docs count) are logged for RAG features
[ ] Logs are structured (JSON), not prose
[ ] A mechanism exists to enable prompt-level debug logging per-request without redeployment

Where This Series Ends

Twelve articles. One mental model, built layer by layer.

You have the complete picture: what AI engineering actually is (article one), how to build the first feature (two), how to make prompts reliable (three), how to make outputs reliable (four), how to reduce hallucinations (five), how to add retrieval (six and seven), how to measure correctness (eight), how to control cost and latency (nine), what breaks in production (ten), how to design systems that compose (eleven), and how to debug them when they do not (twelve).

The pattern underlying all of it: observe → analyze → improve → measure → ship. Every failure is diagnostic information. Every evaluation failure is a test case. Every cost spike is a design constraint that was missing from the original design.

AI engineering is not about the model. It is about everything you build around it.

← All articles