16 min

Evaluating LLM Outputs (Your First Eval System)

Building the evaluation infrastructure that lets you know if your AI system is actually working — test datasets, scoring criteria, automation, and a continuous loop that catches regressions before users do.

Why Evaluation Is Not Optional

Every team knows they should test their AI system. Almost no one does it systematically before the first regression ships.

The pattern: you build the feature, test it manually on ten inputs, it looks good, you ship. Three weeks later a user complains about a bad answer. You investigate, realize the failure happens on 8% of production inputs (which you did not test), add a fix, ship again. This continues in reactive loops until the system is stable enough to stop generating daily complaints — which takes months.

Evaluation infrastructure breaks this loop. It replaces "test manually on a few examples" with a systematic process that catches failures before they reach users. The concept is the same as automated testing in any other software — the difference is that AI outputs cannot be compared for exact equality, so your test suite evaluates properties, not values.

This article builds the minimal viable evaluation system. No frameworks required. No external services. Raw code that you can implement today.


The Test Dataset

Everything starts here. A test dataset is a set of (input, criteria) pairs where each pair represents a behavioral requirement of the system.

What is NOT a test case:

input: "Summarize this article: <text>"
expected_output: "The article discusses the impact of..."  // exact string match

LLM outputs are non-deterministic. An exact string match fails on every run where the model produces a slightly different phrasing of a correct answer. This evaluates the phrasing, not the behavior.

What IS a test case:

interface EvalCase {
  id: string;
  input: Record<string, unknown>;
  checks: Array<(output: unknown) => boolean | string>;
  description: string;
}

A check is a function that takes the output and returns true (pass), false (fail), or a string (the string is an error message — implicitly a fail). Checks evaluate properties:

  • Is the output valid JSON?
  • Does the summary field exist and contain between 20 and 200 words?
  • Do all cited sources appear verbatim in the source document?
  • Is the classification one of the valid category values?
  • Does the response NOT contain the word "as an AI language model"?

Building your first dataset:

Start with 20 cases covering three categories:

  1. Common cases (10): Representative inputs from your expected usage. The everyday case.
  2. Edge cases (7): Short inputs, long inputs, empty inputs, inputs in the wrong language, inputs with missing fields.
  3. Known hard cases (3): Inputs where you already know the system historically produced bad output. Every production failure is a new test case.

Twenty cases is not comprehensive. It is the minimum from which a pass threshold is meaningful. Grow it over time — add 3–5 cases every week as you discover new failure modes.


Defining Checks

Checks should be precise enough to catch real failures and broad enough not to reject correct outputs on minor variation.

For the summarization system from articles two through four:

import type { SummaryResult } from "./types";

const summaryEvalCases: EvalCase[] = [
  {
    id: "summary-basic-001",
    description: "Standard article summary: structural and length checks",
    input: {
      text: `The quarterly earnings report showed revenue of $1.2B, up 15% year over year.
             The growth was driven primarily by enterprise segment expansion in North America.
             Operating margins improved 2.3 percentage points to 23.1%.`,
      maxWords: 100,
    },
    checks: [
      (output: unknown) => {
        if (typeof output !== "object" || output === null) return "output is not an object";
        const o = output as Partial<SummaryResult>;
        if (typeof o.summary !== "string") return "summary is not a string";
        if (typeof o.wordCount !== "number") return "wordCount is not a number";
        if (!Array.isArray(o.keyPoints)) return "keyPoints is not an array";
        return true;
      },
      (output: unknown) => {
        const o = output as SummaryResult;
        const words = o.summary.split(/\s+/).filter(Boolean);
        if (words.length > 100) return `summary exceeds 100 words (${words.length})`;
        if (words.length < 15) return `summary too short (${words.length} words)`;
        return true;
      },
      (output: unknown) => {
        const o = output as SummaryResult;
        if (o.keyPoints.length < 1 || o.keyPoints.length > 5)
          return `keyPoints length invalid: ${o.keyPoints.length}`;
        return true;
      },
      (output: unknown) => {
        const o = output as SummaryResult;
        const INPUT_TEXT = `The quarterly earnings report showed revenue of $1.2B`;
        if (!o.summary.includes("revenue") && !o.summary.includes("earnings"))
          return "summary does not reference financial content from source";
        return true;
      },
    ],
  },
  {
    id: "summary-short-input-002",
    description: "Short document (under 50 words) should return error object",
    input: { text: "Short.", maxWords: 100 },
    checks: [
      (output: unknown) => {
        if (typeof output !== "object" || output === null) return "output is not an object";
        const o = output as Record<string, unknown>;
        if (!("error" in o)) return "expected error field for short document";
        return true;
      },
    ],
  },
  {
    id: "summary-no-hallucination-003",
    description: "Summary should not introduce entities not in source",
    input: {
      text: "The board approved a $500M share buyback program.",
      maxWords: 50,
    },
    checks: [
      (output: unknown) => {
        const o = output as SummaryResult;
        // Check that fabricated numbers (anything other than $500M) do not appear
        const fabricatedNumbers = o.summary.match(/\$[\d.]+[BMK]?\b/g) ?? [];
        const invalidNums = fabricatedNumbers.filter((n) => n !== "$500M");
        if (invalidNums.length > 0)
          return `Summary contains numbers not in source: ${invalidNums.join(", ")}`;
        return true;
      },
    ],
  },
];

The hallucination check (case 003) is imperfect — it only catches fabricated dollar amounts. That is the point. Each check should target one specific failure mode. Multiple focused checks are better than one complex check.


The Evaluation Runner

interface EvalResult {
  caseId: string;
  description: string;
  passed: boolean;
  failures: string[];
  latencyMs: number;
  raw: unknown;
}

async function runEvalSuite(
  evalCases: EvalCase[],
  system: (input: Record<string, unknown>) => Promise<unknown>
): Promise<{ results: EvalResult[]; passRate: number }> {
  const results: EvalResult[] = [];

  for (const evalCase of evalCases) {
    const start = Date.now();
    let raw: unknown = null;
    const failures: string[] = [];

    try {
      raw = await system(evalCase.input);

      for (const check of evalCase.checks) {
        const result = check(raw);
        if (result !== true) {
          failures.push(typeof result === "string" ? result : "check failed");
        }
      }
    } catch (err) {
      failures.push(`System threw: ${err instanceof Error ? err.message : String(err)}`);
    }

    results.push({
      caseId: evalCase.id,
      description: evalCase.description,
      passed: failures.length === 0,
      failures,
      latencyMs: Date.now() - start,
      raw,
    });
  }

  const passRate = results.filter((r) => r.passed).length / results.length;
  return { results, passRate };
}

Running it:

const PASS_THRESHOLD = 0.90; // 90% pass rate required

async function runAndReport() {
  const { results, passRate } = await runEvalSuite(summaryEvalCases, async (input) => {
    const { text, maxWords } = input as { text: string; maxWords?: number };
    return summarizeWithRetry(text, maxWords);
  });

  for (const result of results) {
    if (!result.passed) {
      console.log(`\nFAIL: ${result.caseId} — ${result.description}`);
      result.failures.forEach((f) => console.log(`  ↳ ${f}`));
    }
  }

  console.log(`\nPass rate: ${(passRate * 100).toFixed(1)}%`);

  if (passRate < PASS_THRESHOLD) {
    process.exit(1); // Fail CI
  }
}

runAndReport();

Run this in CI on every prompt change and on a scheduled basis (daily or weekly) against production. If pass rate drops, the pipeline fails. An alert fires. You investigate before users notice.


LLM-as-Judge: Scaling Semantic Evaluation

Rule-based checks catch structural and surface-level failures. They cannot catch semantic failures: the summary is factually wrong but structurally perfect, or the key points are accurate but miss the document's main conclusion.

Enter LLM-as-judge: use a second model call to evaluate the output against semantic criteria.

interface JudgmentResult {
  passes: boolean;
  score: number; // 1–5
  reasoning: string;
  issues: string[];
}

async function judgeOutput(
  input: string,
  output: string,
  criteria: string[]
): Promise<JudgmentResult> {
  const criteriaText = criteria.map((c, i) => `${i + 1}. ${c}`).join("\n");

  const judgePrompt = `You are an output quality evaluator. Assess the output against each criterion.

INPUT:
${input}

OUTPUT:
${output}

EVALUATION CRITERIA:
${criteriaText}

Return JSON:
{
  "score": <1-5 integer: 1=poor, 3=acceptable, 5=excellent>,
  "passes": <boolean: true if score >= 3 and no critical failures>,
  "reasoning": "<one sentence summary of your assessment>",
  "issues": ["<issue>", ...]
}`;

  const response = await callModel(judgePrompt, { temperature: 0 });
  return JSON.parse(response) as JudgmentResult;
}

Calibrating the judge:

The judge model is itself an LLM. It has its own failure modes. Before trusting it, calibrate it:

  1. Collect 30 outputs you have manually labeled as good/bad.
  2. Run the judge over all 30.
  3. Measure agreement: what fraction does the judge classify the same way you did?
  4. If agreement is below 80%, your criteria are too vague — make them more specific.
async function calibrateJudge(
  calibrationSet: Array<{ input: string; output: string; humanLabel: "pass" | "fail" }>,
  criteria: string[]
): Promise<number> {
  let agreements = 0;

  for (const sample of calibrationSet) {
    const judgment = await judgeOutput(sample.input, sample.output, criteria);
    const judgeLabel = judgment.passes ? "pass" : "fail";
    if (judgeLabel === sample.humanLabel) agreements++;
  }

  return agreements / calibrationSet.length;
}

Run this once before deploying LLM-as-judge for a new feature. A calibration score below 0.80 means your criteria need refinement.


The Continuous Evaluation Loop

One-time evals tell you if the system is good today. Continuous evaluation tells you if it stays good.

Architecture:

Production Traffic
       │
       ▼
 Sample 5% of requests ──► Eval Queue
                                │
                                ▼
                         Run rule-based checks
                                │
                                ▼
                    For failures: LLM-as-judge
                                │
                                ▼
                         Pass rate dashboard
                                │
                          Below threshold?
                                │
                          Alert fires

The sampling rate (5%) is a starting point. At 100K requests/day, 5% is 5,000 evals. At 1M requests/day, reduce to 1% (10,000 evals). The goal is enough volume for statistical significance, not evaluating everything.

Implementation:

// Sample and evaluate production traffic
async function evalProductionSample(
  request: { input: Record<string, unknown>; output: unknown },
  samplingRate: number = 0.05
): Promise<void> {
  if (Math.random() > samplingRate) return; // Skip most requests

  const checks = getRuleChecksForRequestType(request);
  const failures: string[] = [];

  for (const check of checks) {
    const result = check(request.output);
    if (result !== true) failures.push(typeof result === "string" ? result : "failed");
  }

  const passed = failures.length === 0;

  await metrics.record({
    event: "eval.production_sample",
    passed,
    failures: failures.slice(0, 3), // Don't log full failure lists to keep metrics lightweight
    timestamp: new Date().toISOString(),
  });

  // If rule-based checks fail, trigger async judge evaluation
  if (!passed) {
    await judgeQueue.push({ request, failures });
  }
}

Alerting:

Alert when the rolling 24-hour pass rate drops below your threshold. For a new system, start with 90%. Raise it as you iterate.

The alert should include:

  • Current pass rate vs threshold
  • Most common failure reason (by frequency)
  • Sample of failing inputs (for investigation)

The Failure Walkthrough: No Clear Metric

Scenario: A team ships a meeting summary feature. It works well at launch. Three months later, users complain about summaries being too long. The team shortens them. Two weeks later, different users complain that key decisions are missing from the summaries.

What broke: There was no defined success metric. "Good summary" was not defined. The team oscillated between competing preferences with no objective measure of which version was better.

Fix — define the metric before shipping:

const meetingSummaryChecks = [
  // Length: not too long, not too short
  (o: MeetingSummary) => {
    const words = o.summary.split(/\s+/).length;
    if (words > 200) return `summary too long: ${words} words`;
    if (words < 40) return `summary too short: ${words} words`;
    return true;
  },
  // Must reference all speakers if 2+ people are named in transcript
  (o: MeetingSummary, _: unknown, ctx: { speakerCount: number }) => {
    if (ctx.speakerCount >= 2) {
      const speakersReferenced = o.summary.match(/\b[A-Z][a-z]+\b/g) ?? [];
      if (speakersReferenced.length < 2)
        return "summary does not reference multiple speakers from transcript";
    }
    return true;
  },
  // Must include action items field
  (o: MeetingSummary) => {
    if (!Array.isArray(o.actionItems) || o.actionItems.length === 0)
      return "no action items extracted — check if transcript contains decisions";
    return true;
  },
];

When a user complains summaries are "too long," you check your eval data and find that average length has crept from 80 words to 165 words over six weeks — a prompt regression. You have a metric. You fix the root cause rather than guessing.

When another user complains about missing decisions, you check: your actionItems check shows a 94% pass rate, so the issue is either scope (which decisions count?) or a specific input type not in your eval set. You add test cases for that input type.

The metric does not make the complaint go away. It tells you which problem you actually have.


Summary: Minimum Viable Eval System

| Component | Minimum Version | Upgrade Path | |---|---|---| | Test dataset | 20 hand-crafted cases | Grow to 100+ via production failures | | Checks | Rule-based (structural, range) | Add LLM-as-judge for semantic | | Runner | Simple loop, CLI output | CI integration with exit codes | | Continuous eval | Random sampling, logged | Dashboard with rolling pass rates and alerts | | Coverage | Common cases + known edge cases | Adversarial + domain-specific + regression cases |

Run evals before every prompt change. Run a sample of production traffic continuously. Treat a pass rate drop the same way you treat a failing test in CI — stop, investigate, fix.


What Is Next

You now have the core infrastructure: a reliable service (articles two through four), hallucination mitigations (five), retrieval that works (six through seven), and evaluation that measures it (eight). Article nine confronts the constraints that limit what you can actually ship: token cost, latency, and how to engineer around both.