March 26, 202516 min

Intro to AI System Design

Pipelines vs agents, orchestration patterns, when orchestration is the wrong abstraction — the architectural decisions that determine whether your AI system is maintainable or a debugging nightmare.

The Architecture Question

Articles two through ten built one thing: a reliable AI feature. A single endpoint that calls a model, validates output, handles failures, measures cost, and degrades gracefully. The engineering principles in those articles apply regardless of complexity.

Now scale the problem. What happens when you need ten features? When one AI call needs to feed another? When the system needs to decide which path to take based on intermediate results? When the user expects a system that reasons across multiple steps?

You need to think architecturally. This article is the decision framework: when to use pipelines, when to use agents, and when neither is the right abstraction.

Two Approaches: Pipelines and Agents

Pipelines are sequences of steps executed in a defined order. Each step takes an input, produces an output, and passes it to the next step. The flow is predetermined. The developer decides at design time what happens when.

Agents are LLM-driven systems where the model decides what to do next. The model receives a goal, available tools, and current state, and selects actions. The flow is dynamic — the sequence of steps depends on the model's decisions at runtime.

This is not a spectrum. It is a choice that has significant implications for reliability, debuggability, and cost. Most teams should start with pipelines and move to agents only when they have a specific reason.

Pipelines: The Default Choice

A pipeline converts one complex task into a sequence of simpler ones. Each step is implemented separately, tested separately, and debugged separately.

Example: document analysis pipeline

Input: Raw document
      │
      ▼
[Step 1: Classify]        → { type: "contract" | "invoice" | "report" }
      │
      ▼
[Step 2: Extract Fields]  → { parties: [...], date: "...", amount: ... }
  (varies by type)
      │
      ▼
[Step 3: Validate]        → { valid: true, issues: [...] }
      │
      ▼
[Step 4: Summarize]       → { summary: "...", keyPoints: [...] }
      │
      ▼
Output: Structured result

Each step uses a focused, single-purpose prompt. Compare this to a single "analyze this document and give me everything" prompt. The single-prompt version has higher failure ambiguity — when it fails, which part failed? The pipeline version has isolated failure points that are easy to log and debug.

Implementing a pipeline step:

interface PipelineContext {
  documentId: string;
  rawContent: string;
  results: Record<string, unknown>;
}

type PipelineStep = {
  name: string;
  fn: (ctx: PipelineContext) => Promise<unknown>;
};

async function runPipeline(
  context: PipelineContext,
  steps: PipelineStep[]
): Promise<PipelineContext> {
  let ctx = { ...context };

  for (const step of steps) {
    const stepStart = Date.now();
    try {
      const result = await step.fn(ctx);
      ctx = {
        ...ctx,
        results: { ...ctx.results, [step.name]: result },
      };
      logger.info({
        event: "pipeline.step_complete",
        step: step.name,
        documentId: ctx.documentId,
        latencyMs: Date.now() - stepStart,
      });
    } catch (err) {
      logger.error({
        event: "pipeline.step_failed",
        step: step.name,
        documentId: ctx.documentId,
        error: String(err),
      });
      throw new Error(`Pipeline failed at step '${step.name}': ${err}`);
    }
  }

  return ctx;
}

Usage:

const analysisPipeline: PipelineStep[] = [
  { name: "classify", fn: (ctx) => classifyDocument(ctx.rawContent) },
  {
    name: "extract",
    fn: (ctx) => {
      const type = (ctx.results["classify"] as ClassifyResult).type;
      return extractFields(ctx.rawContent, type);
    },
  },
  { name: "validate", fn: (ctx) => validateExtraction(ctx.results["extract"]) },
  { name: "summarize", fn: (ctx) => summarizeDocument(ctx.rawContent) },
];

const result = await runPipeline(
  { documentId: "doc-001", rawContent: documentText, results: {} },
  analysisPipeline
);

Each step is independently testable. Each step's output is logged. When the pipeline fails midway, you know exactly which step failed and what its input was.

When pipelines work best:

The task can be decomposed into stages with known dependencies
The sequence of steps is always the same (or branches on defined conditions)
Each step's output is well-defined enough to be validated
Parallelism is possible for independent steps

Pipeline parallelism:

Steps with no dependencies on each other can run concurrently:

async function runParallelSteps(
  ctx: PipelineContext,
  steps: PipelineStep[]
): Promise<PipelineContext> {
  const results = await Promise.all(steps.map((step) => step.fn(ctx)));
  let merged = { ...ctx };
  steps.forEach((step, i) => {
    merged = {
      ...merged,
      results: { ...merged.results, [step.name]: results[i] },
    };
  });
  return merged;
}

// Classify and extract metadata in parallel, then summarize after
const ctx1 = await runParallelSteps(initialCtx, [classifyStep, extractMetadataStep]);
const ctx2 = await runPipeline(ctx1, [summarizeStep]);

Parallelism reduces total latency without changing cost (both steps run the same number of model calls). Use it whenever step outputs are independent.

Agents: When You Actually Need Them

An agent is appropriate when the sequence of steps cannot be determined in advance because it depends on intermediate results or environmental state that is not known at design time.

Example question: "Is this contract valid and does it comply with our standard terms?" This might require: reading the contract, identifying clause types, checking each clause against a policy database, looking up specific regulations if certain clause types appear, generating a compliance report. The number of lookups depends on the contract's content. You cannot predefine the pipeline.

The minimal agent pattern:

interface Tool {
  name: string;
  description: string;
  parameters: Record<string, unknown>; // JSON Schema
  fn: (params: Record<string, unknown>) => Promise<unknown>;
}

interface AgentState {
  goal: string;
  messages: Array<{ role: "user" | "assistant" | "tool"; content: string }>;
  toolResults: Record<string, unknown>;
}

async function runAgent(
  goal: string,
  tools: Tool[],
  maxIterations: number = 10
): Promise<string> {
  const state: AgentState = {
    goal,
    messages: [{ role: "user", content: goal }],
    toolResults: {},
  };

  for (let iteration = 0; iteration < maxIterations; iteration++) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-5",
      max_tokens: 1024,
      system: `You are an autonomous agent. Complete the given goal using the provided tools.
When the goal is complete, summarize your findings and stop using tools.`,
      tools: tools.map((t) => ({
        name: t.name,
        description: t.description,
        input_schema: t.parameters,
      })),
      messages: state.messages as MessageParam[],
    });

    // Check if model called a tool
    const toolUse = response.content.find((c) => c.type === "tool_use");

    if (!toolUse) {
      // Model returned a final text response — task complete
      const textContent = response.content.find((c) => c.type === "text");
      return textContent?.text ?? "Task complete.";
    }

    if (toolUse.type !== "tool_use") break;

    // Execute the tool
    const tool = tools.find((t) => t.name === toolUse.name);
    if (!tool) throw new Error(`Unknown tool: ${toolUse.name}`);

    const toolResult = await tool.fn(toolUse.input as Record<string, unknown>);
    state.toolResults[toolUse.name] = toolResult;

    // Add tool use and result to conversation history
    state.messages.push(
      { role: "assistant", content: JSON.stringify(response.content) },
      {
        role: "user",
        content: JSON.stringify([{
          type: "tool_result",
          tool_use_id: toolUse.id,
          content: JSON.stringify(toolResult),
        }]),
      }
    );
  }

  throw new Error(`Agent did not complete within ${maxIterations} iterations`);
}

The maxIterations parameter is not optional. Without it, an agent that gets confused can loop indefinitely, calling tools repeatedly and exhausting your budget. Set it low enough to catch runaway agents quickly (10–20 is usually sufficient for real tasks).

When to Avoid Agents

Agents have two significant costs that pipelines do not:

1. Reliability. The model decides what to do next. When the model makes a wrong decision — calls the wrong tool, misinterprets a result, loops on a subproblem — there is no deterministic path back to correct behavior. You cannot catch this with structural validation. The agent may produce a coherent, plausible-sounding final answer that is based on incorrect tool calls.

2. Debuggability. A pipeline failure tells you exactly which step failed and what its input was. An agent failure requires you to trace through the entire conversation history — all intermediate tool calls and results — to understand what went wrong. This is substantially harder.

Rules for agent vs pipeline:

Use a pipeline when:

The task has a known sequence of stages
Each stage has a well-defined input and output
All branches in the flow are deterministic
Reliability > flexibility

Use an agent when:

The sequence of steps genuinely cannot be known in advance
The task requires exploring unknown state (web search, database discovery)
The tool set is large and the model needs to select the right subset per task
You have the observability infrastructure to debug agent behavior

The common mistake: using agents for tasks that could be pipelines. "We need flexibility" is not a reason to use agents. Agents are harder to test, harder to debug, and harder to make reliable. Use them only when the task's dynamic nature makes pipelines inadequate.

Orchestration Patterns

Fan-out / Fan-in: Run multiple analyses in parallel, aggregate results.

                    ┌── [Sentiment Analysis] ──┐
Input Document ─────┼── [Topic Classification] ─┼──► [Aggregator] ──► Final Result
                    └── [Entity Extraction] ──┘

Useful when you need multiple independent analyses of the same input. All three branches run concurrently. The aggregator combines results.

Conditional branching: The path depends on an intermediate result.

async function processDocument(document: string): Promise<ProcessedDocument> {
  const classification = await classifyDocument(document);

  if (classification.type === "contract") {
    return processContract(document, classification);
  } else if (classification.type === "invoice") {
    return processInvoice(document, classification);
  } else {
    return processGeneric(document, classification);
  }
}

This is a pipeline with conditional routing, not an agent. The branching logic is deterministic and developer-defined. The model only classifies; the routing decision is code.

Map-reduce: Process a collection, reduce to a summary.

async function summarizeDocumentCollection(documents: string[]): Promise<string> {
  // Map: summarize each document
  const summaries = await Promise.all(
    documents.map((doc) => summarizeDocument(doc))
  );

  // Reduce: synthesize summaries into one
  const combinedPrompt = summaries
    .map((s, i) => `Document ${i + 1}: ${s.summary}`)
    .join("\n\n");

  return synthesizeSummaries(combinedPrompt);
}

Useful when:

Each document is too large to fit in the context window (map first, summarize small chunks)
You need a cross-document synthesis that would consume too much context if processed together

The Over-Engineering Failure

Scenario: A team has a feature that classifies customer feedback as positive, neutral, or negative. They build it as an agent with three tools: a sentiment analyzer, a topic extractor, and a priority scorer. The agent calls all three on every input, synthesizes the results, and returns a classification.

Reality: A single prompt with the classification schema does this in one model call. The agent approach adds three model calls per request, triples the cost, adds orchestration latency, and introduces three additional failure points. When a classification is wrong, the team has to trace through three tool call results to find the error.

The fix: Recognize that the task is simple enough for a pipeline step (or even a single prompt) and does not require dynamic tool selection. Decompose into tools only when the task genuinely requires it.

The test: can you write out the full sequence of steps as deterministic code? If yes — if you can enumerate all branches in the flow — use a pipeline. If the step sequence literally cannot be determined without executing the task and observing intermediate state, an agent may be warranted.

What Comes Next

You know how to build reliable features (articles two through four), avoid hallucinations (five), add retrieval (six through seven), evaluate systematically (eight), control cost (nine), handle production failures (ten), and design system architecture (eleven).

Article twelve is the final piece: debugging. When something goes wrong in a system with multiple AI components, how do you find the failure? That requires tracing, structured logging, and a systematic method for isolating failure sources — the skills that separate an engineer who ships AI systems from one who ships them and hopes they work.

← All articles