16 min

First LLM App: From API Call to Working Feature

Building a real text summarizer API from scratch — handling latency, malformed responses, retries, and the gap between 'it works locally' and a feature you can actually ship.

Where We Left Off

In the first article, you built the mental model: every AI system is a pipeline of stages, the model is just one component, and everything around it is engineering. You saw the failure modes in theory. Now you build something real and hit them in practice.

This article constructs a minimal LLM-backed service — a text summarization API — from the ground up. The MVP takes four hours to write. The actual engineering takes the rest of the week. That gap is what this article is about.

Assumptions: you have API access to at least one LLM provider (OpenAI, Anthropic, or Gemini), a server runtime (Node.js or .NET — examples are in TypeScript, trivially portable), and you have read article one of this series. If you have not, go back — the mental model there is used directly here without re-explanation.


What We Are Building

A POST /summarize endpoint that accepts a text body and returns a structured summary.

Inputs:

{ "text": "...", "maxWords": 150 }

Expected output:

{
  "summary": "...",
  "wordCount": 134,
  "keyPoints": ["...", "...", "..."]
}

This is deliberately simple. Simple enough that the naive version almost works. Complex enough that the naive version reliably fails in production.


Iteration 1 — The Naive Version

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function summarize(text: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-haiku-3-5",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: `Summarize this text: ${text}`,
      },
    ],
  });

  return response.content[0].text;
}

This works. Run it against a few inputs manually and you get back sensible prose. Commit it, ship it.

What breaks immediately in production:

  1. The model returns plain prose. Your frontend expects JSON. The caller receives unstructured text, cannot parse a wordCount, cannot render keyPoints as a list. The UI breaks.

  2. Users paste arbitrarily large inputs. A 50,000-token document takes 40 seconds and costs $0.50 per request. Your $50/day budget evaporates in two hours.

  3. response.content[0].text throws if content is empty, which happens on certain error conditions. No error handling means a 500 with an unhandled exception leak.

  4. The timeout is set by... nothing. If the model takes 30 seconds on a long document, the request hangs. Your load balancer kills it at 30s. The client gets a TCP reset, not an error message.

None of these are edge cases. All of them happen within the first day of real traffic.


Iteration 2 — Add Structure and Constraints

Start with the prompt. This is where most of the leverage is, and it costs nothing to improve.

const SUMMARIZE_SYSTEM_PROMPT = `You are a document summarizer. Your job is to produce concise, accurate summaries.

Return ONLY a JSON object with this exact structure — no other text, no markdown code fences, no explanations:
{
  "summary": "<summary text, max {{maxWords}} words>",
  "wordCount": <integer: actual word count of the summary>,
  "keyPoints": ["<point>", "<point>", "<point>"]
}

Rules:
- The summary must be factually grounded in the source document only. Do not add information.
- Include exactly 3 key points.
- wordCount must accurately reflect the word count of the summary field.
- Do not include any text before or after the JSON object.`;

Two things happened here that matter. First, the output format is specified precisely — not "return JSON" but "return this exact structure." Second, there are explicit prohibitions: no markdown fences, no explanations. The model will naturally add preamble ("Sure, here's a summary:") unless you explicitly remove that affordance.

Now add input constraints:

const MAX_INPUT_CHARS = 12_000; // ~3,000 tokens, well within context limits
const MAX_OUTPUT_TOKENS = 512;   // Enough for any reasonable summary

export async function summarize(
  text: string,
  maxWords: number = 150
): Promise<string> {
  // Truncate input. Crude — chunking is better, covered later in this series.
  const truncated = text.length > MAX_INPUT_CHARS
    ? text.slice(0, MAX_INPUT_CHARS) + "\n\n[Document truncated]"
    : text;

  const systemPrompt = SUMMARIZE_SYSTEM_PROMPT.replace("{{maxWords}}", String(maxWords));

  const response = await client.messages.create({
    model: "claude-haiku-3-5",
    max_tokens: MAX_OUTPUT_TOKENS,
    system: systemPrompt,
    messages: [{ role: "user", content: truncated }],
  });

  return response.content[0].text;
}

This is better. Input length is bounded. Output length is bounded. The prompt specifies format and constraints. The model variable is extracted so you can swap it without touching business logic.

What still breaks: the output is still raw text. You need to parse it and validate it before it leaves the function.


Iteration 3 — Parse, Validate, Type

interface SummaryResult {
  summary: string;
  wordCount: number;
  keyPoints: string[];
}

function parseAndValidate(raw: string): SummaryResult {
  let parsed: unknown;

  try {
    parsed = JSON.parse(raw);
  } catch {
    // Common failure: model wraps JSON in markdown fences
    // Attempt to extract the JSON object before failing
    const match = raw.match(/\{[\s\S]*\}/);
    if (!match) {
      throw new Error(`Model returned non-JSON output. Raw: ${raw.slice(0, 200)}`);
    }
    try {
      parsed = JSON.parse(match[0]);
    } catch {
      throw new Error(`Extracted content did not parse as JSON. Raw: ${raw.slice(0, 200)}`);
    }
  }

  if (typeof parsed !== "object" || parsed === null) {
    throw new Error("Parsed value is not an object");
  }

  const data = parsed as Record<string, unknown>;

  if (typeof data.summary !== "string" || data.summary.trim() === "") {
    throw new Error(`Invalid or missing 'summary' field`);
  }
  if (typeof data.wordCount !== "number" || data.wordCount <= 0) {
    throw new Error(`Invalid or missing 'wordCount' field: ${data.wordCount}`);
  }
  if (
    !Array.isArray(data.keyPoints) ||
    data.keyPoints.length < 2 ||
    data.keyPoints.some((p) => typeof p !== "string")
  ) {
    throw new Error(`Invalid 'keyPoints' field: ${JSON.stringify(data.keyPoints)}`);
  }

  return {
    summary: data.summary,
    wordCount: data.wordCount,
    keyPoints: data.keyPoints as string[],
  };
}

This parser does three things. First, it attempts direct JSON parsing. Second, it falls back to extracting a JSON object from the response text using a regex — this handles cases where the model prefixes the JSON with a sentence despite your instruction. Third, it validates field types and constraints. Structural type matching is not enough; wordCount: "one hundred" would pass a type check and break downstream code.

The regex fallback (/\{[\s\S]*\}/) is a real pattern used in production. It is not elegant, and it should not be required — but until you achieve 99%+ format compliance on the model tier you are using, it is the difference between a graceful response and a 500.

Failure worth logging: when the regex path triggers, log it. Each occurrence is a signal that your prompt is not constraining the model tightly enough on that input pattern. You want this rate trending toward zero.


Iteration 4 — Handle Latency and Timeouts

The model API is a network call to a third-party service. It can be slow. It can time out. It can return 429 (rate limit), 500 (server error), or 503 (service unavailable). None of these are exceptional; all of them happen in production.

Add a timeout wrapper:

async function callWithTimeout<T>(
  fn: () => Promise<T>,
  timeoutMs: number,
  label: string
): Promise<T> {
  const timeout = new Promise<never>((_, reject) =>
    setTimeout(() => reject(new Error(`${label} timed out after ${timeoutMs}ms`)), timeoutMs)
  );
  return Promise.race([fn(), timeout]);
}

Then wrap the model call:

const raw = await callWithTimeout(
  () =>
    client.messages
      .create({ model: "claude-haiku-3-5", max_tokens: MAX_OUTPUT_TOKENS, system: systemPrompt, messages: [{ role: "user", content: truncated }] })
      .then((r) => r.content[0].text),
  8_000, // 8 second timeout — tune based on your p95 latency
  "model call"
);

The timeout value matters. Too low and you abort requests that would have succeeded. Too high and slow requests pile up, exhausting your connection pool. Start at 2–3x your measured p95 latency. If your p95 is 2.5 seconds, set the timeout to 7 seconds.

Measure your actual p95. Do not guess. Run 100 requests through a staging environment and log response times before picking a timeout value.


Iteration 5 — Add Retries

Retries are not about being defensive against rare events. Transient errors from LLM APIs happen at 1–3% rates on average, and during peak periods or model updates they spike. Without retries, that 1–3% becomes user-facing failures.

The wrong retry implementation:

// WRONG: retry the same thing, blindly
for (let i = 0; i < 3; i++) {
  try {
    return await callModel(text);
  } catch {}
}

This retries on all errors, including validation failures, which a retry will not fix. It also has no backoff, so three rapid retries under a 429 rate limit make the situation worse.

The right implementation distinguishes failure types:

type FailureKind = "transient" | "validation" | "timeout" | "fatal";

function classifyError(err: unknown): FailureKind {
  if (err instanceof Error) {
    const msg = err.message;
    if (msg.includes("timed out")) return "timeout";
    if (msg.includes("rate limit") || msg.includes("529") || msg.includes("503")) return "transient";
    if (msg.includes("Invalid") || msg.includes("non-JSON")) return "validation";
  }
  return "fatal";
}

async function summarizeWithRetry(
  text: string,
  maxWords: number = 150
): Promise<SummaryResult> {
  let lastError: Error | null = null;
  let promptSuffix = "";

  for (let attempt = 1; attempt <= 3; attempt++) {
    try {
      const raw = await callWithTimeout(
        () => callModel(text + promptSuffix, maxWords),
        8_000,
        `model call attempt ${attempt}`
      );
      return parseAndValidate(raw);
    } catch (err) {
      lastError = err instanceof Error ? err : new Error(String(err));
      const kind = classifyError(err);

      if (kind === "fatal") throw lastError;

      if (kind === "validation") {
        // Reinforce formatting on next attempt
        promptSuffix =
          "\n\nIMPORTANT: Your previous response did not match the required JSON format. " +
          "Return ONLY the JSON object. No other text.";
      }

      if (kind === "transient" || kind === "timeout") {
        // Exponential backoff: 500ms, 1000ms
        await new Promise((r) => setTimeout(r, 500 * attempt));
      }

      // On final attempt, don't retry
      if (attempt === 3) break;
    }
  }

  throw new Error(`Summarization failed after 3 attempts. Last error: ${lastError?.message}`);
}

Key decisions here:

  • Validation failures get a modified prompt. Retrying the same prompt that produced malformed JSON will produce malformed JSON again with the same probability. Reinforcing the constraint changes the prompt and gives the model new signal.
  • Transient errors get a delay. Retrying instantly under a rate limit amplifies the problem. Exponential backoff gives the API time to recover.
  • Fatal errors are not retried. An authentication error will not resolve on retry. Neither will a prompt that causes a content policy violation. Retrying wastes time and budget.
  • A hard cap of 3 attempts. Never loop unbounded. Define your maximum and enforce it.

The Full Service

Wire these pieces into a route handler:

import express from "express";

const app = express();
app.use(express.json());

app.post("/summarize", async (req, res) => {
  const { text, maxWords } = req.body;

  if (typeof text !== "string" || text.trim().length === 0) {
    return res.status(400).json({ error: "text is required and must be a non-empty string" });
  }
  if (maxWords !== undefined && (typeof maxWords !== "number" || maxWords < 10 || maxWords > 500)) {
    return res.status(400).json({ error: "maxWords must be a number between 10 and 500" });
  }

  const requestStart = Date.now();

  try {
    const result = await summarizeWithRetry(text, maxWords ?? 150);

    console.log(JSON.stringify({
      event: "summarize.success",
      latencyMs: Date.now() - requestStart,
      inputLength: text.length,
      wordCount: result.wordCount,
    }));

    return res.json(result);
  } catch (err) {
    const message = err instanceof Error ? err.message : "Unknown error";

    console.error(JSON.stringify({
      event: "summarize.error",
      latencyMs: Date.now() - requestStart,
      inputLength: text.length,
      error: message,
    }));

    return res.status(500).json({
      error: "Summarization failed",
      detail: message,
    });
  }
});

app.listen(3000, () => console.log("Listening on :3000"));

Notice the structured log entries. Not console.log("summarize succeeded"). Structured JSON with specific fields you will query later: latency, input length, word count. These are the observability primitives you need when you start seeing production issues and need to correlate behavior with inputs.


The Failure Walkthrough — Slow + Malformed

Scenario: You ship this service. Everything works in staging. Day two of production, you start seeing latency spikes. P95 goes from 2.1 seconds to 11 seconds. Separately, a small percentage of responses come back as 500s.

Diagnosing the latency spike:

Check your logs. Look at the distribution of inputLength. You will find that a cohort of users is pasting full research papers — 80,000+ characters. Your truncation at 12,000 characters is working, but the model is still producing long outputs because those inputs are dense and the model is trying to compress a lot of information into the summary. Additionally, you have no request-level timeout — the callWithTimeout wraps the model call but your HTTP server has no overall request timeout.

Fix: Add a server-level timeout. Express does not have a built-in timeout — use the connect-timeout middleware or set a socket timeout on the server. Set it to 15 seconds, log any request that hits it.

Diagnosing the 500s:

Check your error logs for event: "summarize.error". You will see error messages like Invalid 'keyPoints' field: null. The model is returning "keyPoints": null for certain short inputs where there are genuinely no discrete key points.

Fix: Two options. First, make keyPoints nullable in your schema and handle it on the consumer side. Better: update your prompt to explicitly handle this case: "If the document has no distinct key points, return an empty array: []." The model's default behavior when under-constrained is to return null for optional array fields. Specify the edge case explicitly.

This pattern repeats throughout AI engineering: you discover a failure on a specific input class, you add an explicit constraint to the prompt for that class, you test against that class in your evaluation suite. The prompt grows with the system's experience.


What You Should Measure From Day One

You have structured logs. Now define what you actually monitor:

| Metric | What it tells you | |---|---| | latencyMs p50 / p95 / p99 | Where your tail latency lives | | Error rate by error type | Validation failures vs transient vs fatal | | Retry rate | What fraction of requests needed a retry | | Input length distribution | Who's sending large inputs, if truncation is working | | wordCount distribution | Whether summaries are staying within bounds |

Set up a simple dashboard on these from the beginning. You do not need Datadog on day one — a cron job that scans logs and outputs a daily summary works fine. When something breaks, you will have the baseline to compare against.


Where the Gap Is

The jump from "API call that works" to "feature that ships" is almost entirely infrastructure around the model call. The model itself did not change between iteration 1 and iteration 5. Everything that changed was:

  • Prompt structure and constraints
  • Input bounding
  • Output parsing and validation
  • Error classification and retry logic
  • Structured logging
  • Timeout handling

That list is the minimum viable engineering layer for any LLM feature. Skip any item and you have a demo, not a feature.

In the next article: prompt engineering in depth — the specific techniques that close the gap between "sometimes correct" and "reliably correct" outputs.


Appendix: Minimal .env Configuration

ANTHROPIC_API_KEY=sk-ant-...
MODEL_ID=claude-haiku-3-5
MAX_INPUT_CHARS=12000
MAX_OUTPUT_TOKENS=512
REQUEST_TIMEOUT_MS=8000
MAX_RETRY_ATTEMPTS=3

Pull all tuneable values from environment variables. When you discover that MAX_INPUT_CHARS=12000 is too low for your use case, you change one line in your deployment configuration, not five places in code.


Appendix: Quick Reference — Types of LLM API Errors

| HTTP Status | Meaning | Retry Strategy | |---|---|---| | 400 | Bad request (your input) | Do not retry. Fix the prompt or input. | | 401 | Authentication failure | Do not retry. Fix your API key. | | 429 | Rate limit exceeded | Retry with exponential backoff. Consider a queue for burst traffic. | | 500 | Model server error | Retry up to 2x with backoff. | | 503 | Model overloaded | Retry with backoff. May indicate sustained platform issues. | | Timeout | Request took too long | Retry once with backoff. If persistent, reduce max_tokens or input length. |

The error classification function from iteration 5 should map to this table. Every error type that reaches your caller without being handled is a gap in your reliability layer.