18 min

What AI Engineering Actually Means (and What It Is Not)

A complete mental model for reasoning about AI systems in production — covering architecture, reliability, evaluation, and the layers most engineers skip when they call it done after the first API response.

Why This Article Exists

Most writing about AI falls into two categories: API tutorials that show you how to get a response from a model in fifteen lines of code, or hype pieces that describe AI as a paradigm shift without telling you what to actually build. Neither helps you ship a real feature that behaves reliably in production.

The gap is system-level thinking. How do you design around a component that is probabilistic? How do you test something that does not have a fixed output? How do you control cost without degrading quality? How do you know if your AI feature is actually working?

This article builds the complete mental model. By the end, you will be able to look at any AI system — yours or someone else's — and reason about where it is fragile, what is missing, and how to improve it. No external resources required.

Target audience: engineers building real product features, not researchers training models.


Precise Definitions — Remove Confusion Early

The terms get conflated constantly. Keeping them distinct matters because the engineering concerns are completely different.

ML Engineering is the discipline of training models. It involves datasets, loss functions, training loops, model architecture, hyperparameter tuning, and deployment of the resulting artifacts. The output is a model. Most engineers working in product companies will never do this work directly.

LLM Usage is calling a pretrained model via an API — OpenAI, Anthropic, Gemini, a self-hosted OSS model. You send tokens in and get tokens back. This is the starting point, not the finished product.

AI Engineering is designing and operating systems that are built around pretrained models. The model is one component. AI engineering is concerned with everything surrounding it: reliability, evaluation, cost management, user experience under non-determinism, and iterative improvement. This is the core topic of this article.

The critical distinction: in ML engineering, you have control over the model itself. In AI engineering, you have no control over the model — you work around it. That shift in constraint defines the entire discipline.


The Core Mental Model

Every AI system, regardless of surface complexity, follows the same pipeline:

Input → Preprocessing → Model → Postprocessing → Output → Evaluation → Feedback Loop

Internalize this. When something breaks, it broke at one of these stages.

Input is whatever enters the system: a user message, an API call payload, a scheduled job trigger, a document, an image. Inputs are often noisy, ambiguous, malformed, or adversarial. The system must handle all of it.

Preprocessing transforms raw input into what the model actually receives. This includes prompt construction, context injection, retrieval (if you are pulling relevant documents), conversation history management, and any normalization of the input. This stage is where most of the engineering leverage lives.

Model is the probabilistic text generator. Given a sequence of tokens, it predicts the next token according to a learned probability distribution. That's the complete mechanistic description. Everything the model "knows" or "decides" reduces to this. It is not a database. It is not a reasoning engine with guaranteed correctness. It is a very capable pattern-completion system.

Postprocessing takes raw model output and makes it usable: parsing JSON, extracting structured fields, stripping unwanted prefixes, normalizing formats, applying business logic. If the model returns {"action": "create_task", "title": "..."}, postprocessing is what reads that and calls your task creation function.

Evaluation determines whether the output was correct, useful, and appropriately formatted. Evaluation can be manual (a human reviews a sample), automated (a second model grades the output), or rule-based (did the JSON parse? Did the required fields exist?). Most teams skip this entirely and treat "it returned something" as success.

Feedback Loop closes the cycle. Evaluation results feed back into prompt changes, system improvements, and model selection. Without this loop, the system degrades silently over time as usage patterns shift.

The key architectural insight: the model is just one component. A failure in your system could originate at any stage. A team that treats AI engineering as "configure the model call" will spend months debugging the wrong layer.


Deterministic vs Non-Deterministic Systems

A standard backend function is deterministic: given the same input, it produces the same output every time. You can write a unit test that asserts exact equality and run it ten thousand times with confidence.

LLM-based systems are non-deterministic by design. The model samples from a probability distribution. Even at temperature zero (the most deterministic setting available), subtle implementation differences across API versions can cause output drift. At any nonzero temperature, the same prompt will produce meaningfully different outputs across calls.

This has direct engineering implications:

Testing becomes probabilistic. You cannot assert output === expectedString. You assert that the output satisfies a set of properties: it is valid JSON, it contains a "summary" field, the summary is under 200 words, it does not include content from outside the source document. Your test suite evaluates a distribution of outputs, not a single one.

Debugging is harder. A bug that only appears 15% of the time, with no reproducible seed, requires you to think statistically. You need logging of actual inputs and outputs, not just error traces.

Guardrails are mandatory. Because the model can always produce unexpected output, every production system needs constraints that catch and handle cases where the output is structurally or semantically wrong.

Failure case: A team tests their AI feature manually a dozen times, sees correct output, ships it. In production, 8% of responses fail because the model occasionally returns a different JSON structure. There are no retries and no validation. Users see an error.

Fix: Write property-based evaluations during development, run them against 50–100 generated samples, and define a pass threshold (e.g., 95% structural validity) as your acceptance criterion before shipping.


Why "Calling an API" Is Not AI Engineering

The naive implementation looks like this:

user_input → send to model → return model_output to user

It works in a demo. It fails in production. Here is what happens:

Hallucinations. The model confidently states facts that are false, cites sources that do not exist, or generates plausible-sounding but incorrect content. With no validation layer, this reaches users verbatim.

Format inconsistency. You ask for JSON. Sometimes you get JSON. Sometimes you get JSON wrapped in a markdown code block. Sometimes you get a prose explanation followed by JSON. If your postprocessing expects raw JSON and gets a code block, it throws.

Latency spikes. Model inference time is variable. A response that normally takes 800ms can take 4 seconds under load or with a longer-than-expected output. If your system has no streaming, no timeout handling, and no user feedback, it looks broken.

Cost explosion. Without input length controls, users can send arbitrarily large inputs. Without output length limits, the model can generate arbitrarily large outputs. At scale, unconstrained token usage destroys your budget.

Missing layers that the naive approach lacks:

  • Validation: is the output structurally and semantically correct?
  • Retries: if validation fails, can the system recover?
  • Evaluation: across all requests, how often is the output actually good?
  • Logging: what inputs and outputs are flowing through the system?

Calling an API is the first five minutes. AI engineering is everything after that.


Anatomy of a Real AI Feature

A production AI feature is a pipeline of components, each responsible for one concern:

Request → [Prompt Builder] → [Context Provider] → [Model Interface] → [Output Validator] → [Retry/Fallback] → Response
                                                                                ↓
                                                                      [Evaluation Layer]

Prompt Builder takes the user request and constructs the full model input: system prompt, instructions, constraints, format specification, examples. It is parameterized, not hardcoded. It produces a versioned, testable artifact.

Context Provider enriches the prompt with relevant information: retrieved documents (RAG), user history, session context, tool outputs. This is what allows the model to operate on information beyond its training data.

Model Interface is the abstraction over the actual API call. It handles authentication, timeout configuration, retry on transient errors (5xx, rate limits), model selection, and parameter setting (temperature, max tokens, response format).

Output Validator checks that the model's response is usable. Structural checks: valid JSON, required fields present, types correct. Semantic checks: content is relevant, length is within bounds, no forbidden patterns. If validation fails, the validator signals the retry logic.

Retry/Fallback Logic decides what to do on failure. Options include: retry the same prompt (useful for transient issues), retry with a modified prompt (add "return only valid JSON, no prose"), reduce temperature, switch to a different model (fallback to a smaller/faster model for retries), or return a graceful degraded response.

Evaluation Layer runs asynchronously, scoring outputs against defined criteria. This feeds a dashboard or alert system so you know when quality degrades before users tell you.

Example: A feature that generates meeting summaries. Naive version: send transcript, return summary. Engineered version: prompt builder adds format instructions and length constraints, context provider injects speaker names and meeting metadata, validator checks that all speakers are referenced and no fabricated topics appear, retry logic reduces temperature on the second attempt if validation fails.


Prompt Is Code — Treat It Like Code

A prompt is not a string you type into a chat interface. It is a software artifact that determines the behavior of your system. It should be treated with the same engineering discipline as code.

Version your prompts. Store them in your repository. Track changes with commit history. Every prompt change is a potential regression — you need to know what changed and when.

Use structured templates. Parameterize your prompts so variables (user input, context, examples) are injected cleanly. This prevents prompt injection vulnerabilities and makes the template testable in isolation.

You are a document summarizer. Summarize the following document in {max_words} words or fewer.
Focus on: {focus_areas}.
Return JSON: {"summary": "...", "key_points": ["...", "..."]}.

Document:
{document_text}

Use few-shot examples strategically. Including 2–3 examples of correct input-output pairs in the prompt dramatically improves output consistency for complex tasks. The model learns the expected format and reasoning style from the examples, not just from the instruction.

Understand regression risk. Changing a prompt to improve performance on one case can silently degrade performance on others. Before any prompt change, run your evaluation suite against the new prompt and compare pass rates.

Failure case: A developer tweaks the prompt to fix a specific user complaint. The change improves that one case and breaks output format for 12% of existing inputs. There is no evaluation suite. The regression ships undetected.

Fix: Before any prompt change, run it against a curated set of test inputs (start with 20–30 representative examples, grow to hundreds) and compare the output quality metrics. Gate prompt deploys on evaluation pass rate.


The Reliability Layer — Most Ignored Part

Reliability in AI systems is not optional. It is the layer that separates a demo from a production feature.

Schema enforcement means specifying the exact structure you expect from the model and validating against it programmatically. If you need JSON with a summary string and a confidence number between 0 and 1, check that. Do not pass raw model output to downstream systems.

Modern LLM APIs offer structured output modes (JSON mode, function calling schemas) that constrain the model to produce syntactically valid JSON. Use them. They eliminate an entire class of parsing failures.

Validation rules go beyond syntax. Business logic validation: does the summary reference the source document? Does the extracted date fall within a reasonable range? Is the generated code syntactically valid? These checks are domain-specific and cannot be delegated to the API.

Retry strategies:

  • Same prompt, transient error: For 5xx errors or rate limits. Exponential backoff.
  • Modified prompt: For semantic failures. Add constraints, rephrase the instruction, add an explicit example.
  • Reduced temperature: When output is correct in structure but inconsistently formatted. Lower temperature increases determinism.
  • Fallback model: When primary model is unavailable or too slow. A smaller model with a simpler prompt can handle degraded-but-functional responses.

Guardrails vs strict constraints. Guardrails are soft checks that flag potential issues (output might be off-topic) but allow the response through. Strict constraints are hard gates (output must parse as JSON, output must not contain PII patterns). Use strict constraints on output structure; use guardrails for semantic quality where strict rejection would be too aggressive.

Failure case: Retry logic retries any failed request three times. A prompt that consistently produces invalid output causes three identical failing calls, triples the cost, and returns an error to the user after maximum latency.

Fix: Distinguish between transient failures (retry same) and persistent failures (modify prompt or fall back). Add a circuit breaker pattern: if a given prompt+model combination has a failure rate above a threshold, route to fallback immediately without exhausting retries.


Evaluation — The Missing Discipline

Most AI features ship without any evaluation infrastructure. The heuristic is "I tested it a few times and it looked good." This is the equivalent of shipping code with no tests because manual QA passed once.

Why "looks good" is not sufficient: You cannot manually review thousands of requests. Output quality degrades silently as usage patterns change. Prompt changes cause regressions you do not catch. A feature that works for the 10 inputs you tested may fail on 15% of inputs in production.

Build a test dataset. Start small: 20–50 examples covering the common case, edge cases, and known hard cases. Each example includes an input and the criteria for a correct output (not always an exact expected string — often a set of properties). Grow this dataset continuously as you find new failure modes.

Define expected behavior precisely. For a summarization system: the summary should be under 150 words, should not contain information not present in the source document, should include the main conclusion, should be in English. These are testable. "Should be a good summary" is not.

Evaluation levels:

  1. Rule-based: Regex, schema validation, length checks. Fast, deterministic, catches structural failures.
  2. Heuristic: ROUGE scores, embedding similarity. Statistical, requires calibration.
  3. LLM-as-judge: A second model evaluates the first model's output against criteria. Scalable, handles semantic quality, but adds cost and latency.
  4. Human: Expensive, slow, but ground truth. Reserve for calibrating automated evaluators.

Continuous evaluation loop: Run your evaluation suite on a sample of live traffic. Alert when pass rates drop below threshold. Review failures to identify new failure modes and add them to the dataset. This loop is what lets the system improve over time instead of degrading.

Example: A team builds a legal clause extraction feature. Their eval dataset has 40 contracts. They define: all extracted clauses must appear verbatim in the source, each clause must have a type label from a fixed taxonomy, no clauses from other documents may appear. They run evals on every prompt change and on a 5% sample of production traffic. They catch a regression three days after a prompt update before any user files a complaint.


Cost and Latency Are First-Class Concerns

Token-based pricing means every character in and out of the model costs money. At small scale this is invisible. At scale it becomes the primary engineering constraint.

Understand the token cost model. You pay for input tokens (everything in the prompt, including system instructions, context, examples) and output tokens (the model's response). A prompt with a 2,000-token context injected on every request, called 10 million times per month, is 20 billion input tokens. That number needs to be on your design spreadsheet before you write the first line of code.

Sources of latency:

  • Network: Round-trip to the model API. Typically 50–200ms baseline.
  • Model inference: Scales with output length. A 2,000-token output takes roughly 10x the inference time of a 200-token output.
  • Retrieval: If you use RAG, vector search and document fetching add latency before the model call even starts.
  • Postprocessing: Validation, retries, and logging all add time.

Basic optimizations:

Shorter prompts. Every token in the system prompt is paid on every call. Compress instructions. Remove redundant phrasing. Use examples only when they meaningfully improve output quality (measure this).

Caching. For inputs that recur (same document, same query pattern), cache model outputs. Even a simple in-memory or Redis cache on a stable prompt hash can eliminate 30–50% of model calls in many applications.

Model selection. Use the smallest model that meets your quality bar. A smaller model at one-fifth the cost and half the latency is not a compromise — it is an engineering win, if it passes your evaluation threshold. Reserve large, expensive models for the tasks that actually require them.

Streaming. For user-facing features, stream the model's output token-by-token instead of waiting for the full response. Time-to-first-token of 200ms feels responsive; waiting 3 seconds for the complete response does not, even if the total time is the same.


Minimal End-to-End Example

Building a "summarize text" API. Four iterations from naive to engineered.

Iteration 1 — Naive

def summarize(text: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        messages=[{"role": "user", "content": f"Summarize this: {text}"}]
    )
    return response.content[0].text

Problems: no length constraint on input or output, no format specification, no validation, no error handling.

Iteration 2 — Add Structure

SYSTEM_PROMPT = """You are a document summarizer.
Return a JSON object with this exact structure:
{"summary": "<summary text>", "word_count": <integer>, "key_points": ["<point>", ...]}
The summary must be 50–150 words. Include 3–5 key points.
Do not add any text outside the JSON object."""

def summarize(text: str) -> str:
    if len(text) > 8000:
        text = text[:8000]  # crude truncation; improve with chunking
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": text}]
    )
    return response.content[0].text

Iteration 3 — Add Validation

import json

def parse_and_validate(raw: str) -> dict:
    try:
        data = json.loads(raw)
    except json.JSONDecodeError:
        raise ValueError("Model returned non-JSON output")
    
    required = {"summary", "word_count", "key_points"}
    if not required.issubset(data.keys()):
        raise ValueError(f"Missing fields: {required - data.keys()}")
    if not (50 <= data["word_count"] <= 200):
        raise ValueError(f"word_count out of range: {data['word_count']}")
    if not (2 <= len(data["key_points"]) <= 6):
        raise ValueError(f"key_points count invalid: {len(data['key_points'])}")
    
    return data

Iteration 4 — Add Retry

def summarize(text: str, max_attempts: int = 3) -> dict:
    last_error = None
    for attempt in range(max_attempts):
        try:
            raw = call_model(text)
            return parse_and_validate(raw)
        except ValueError as e:
            last_error = e
            # On retry, reinforce the JSON constraint in the user turn
            text = text + "\n\nIMPORTANT: Return only the JSON object. No other text."
    raise RuntimeError(f"Failed after {max_attempts} attempts: {last_error}")

The evolution from iteration 1 to iteration 4 is AI engineering. None of it touched the model. All of it makes the system reliable.


Failure Case Walkthrough

Scenario: A content moderation feature classifies user submissions as safe, review, or remove. Input: a short-form text post. Output: a JSON object with a classification field.

What happens:

A user submits: "I can't believe they let him walk free. Disgusting."

The model returns:

This post expresses strong negative sentiment and could be referring to a legal case, 
a sports event, or a political situation. Classification: review.

{"classification": "review", "reason": "ambiguous context"}

Your postprocessing expects raw JSON. It receives prose followed by JSON. json.loads() throws. The request returns a 500. The user's post is neither moderated nor surfaced for human review — it falls into a gap.

Tracing the failure:

  • The prompt asked for JSON but did not prohibit prose preambles.
  • The model added an explanation before the JSON, which is a natural behavior.
  • Postprocessing assumed clean JSON output without validating.
  • No retry logic attempted recovery.

Fix:

  1. Add to the system prompt: "Return only the JSON object. Do not include any text before or after the JSON."
  2. Use the API's JSON mode if available (forces syntactically valid JSON output).
  3. In postprocessing, attempt to extract a JSON object from the response before failing: re.search(r'\{.*\}', raw, re.DOTALL).
  4. On validation failure, retry with the instruction reinforced in the user turn.
  5. If all retries fail, route to a human review queue rather than returning a 500.

Failures of this type — format inconsistency due to model verbosity — are among the most common in production AI systems. The fix is never "the model should just do what I said." The fix is defensive postprocessing.


Key Design Principles

Treat the LLM as an unreliable component. The same way you would not call an external payment API and assume it always succeeds, do not assume the model always returns what you asked for. Design for failure from the first line of code.

Always validate outputs. No model output should reach downstream systems, databases, or users without passing structural and semantic checks. Validation is not optional for edge cases — it is the default path for every response.

Design for failure first. Before building the happy path, define: what is the fallback? What does the user see if the model call fails? What happens if validation fails three times in a row? A system with no answers to these questions is not production-ready.

Measure everything. Evaluation pass rate, cost per request, latency p50/p95/p99, retry rate, fallback rate. If you cannot observe these numbers, you cannot improve the system. Set up logging and metrics from day one.

Iterate with the feedback loop. Every failure in production is a new test case. Every user complaint is a signal. Every evaluation failure is an improvement opportunity. The discipline is in closing the loop: observe → analyze → improve → evaluate → ship.


What to Learn Next

This article establishes the mental model. The remaining depth is in the individual layers:

Prompt engineering (deep): Structured prompt design, chain-of-thought elicitation, tool use, system prompt architecture. The difference between a prompt that works sometimes and one that works reliably.

Structured outputs: JSON schema enforcement, function calling, typed response parsing. Making the model's output consumable by software.

Retrieval-Augmented Generation (RAG): Injecting external knowledge into the model's context at inference time. How to design the retrieval pipeline, chunking strategies, embedding models, and reranking.

Evaluation pipelines: Building automated evaluation infrastructure, LLM-as-judge setups, creating and maintaining eval datasets, tracking evaluation metrics over time in CI.

Each of these is a production engineering problem with the same structure: understand the failure mode, build the layer, measure the result.


Appendix

Glossary

Tokens — The unit of input and output for LLMs. Not words; roughly 0.75 words per token in English. "The quick brown fox" is 4 words and 4–5 tokens. All pricing, context limits, and latency is measured in tokens.

Temperature — A parameter controlling how much randomness is applied to the model's token sampling. Temperature 0 = most deterministic (model picks highest-probability token each time). Temperature 1 = full distribution sampling. Higher temperature = more creative and variable; lower temperature = more consistent and conservative.

Context window — The maximum number of tokens the model can process in a single request (input + output combined). Claude 3.5's context window is 200k tokens. Exceeding it requires chunking, summarization, or truncation strategies.

Embeddings — Dense vector representations of text. Semantically similar text produces vectors that are close in vector space. Used for retrieval: convert documents to embeddings, store them, convert a query to an embedding, find the closest stored vectors, retrieve the corresponding documents.

Hallucination — The model generates content that is factually incorrect or not supported by the input, stated with the same confidence as correct content. Not a bug; an intrinsic property of probabilistic text generation. Mitigated (not eliminated) by grounding the model in retrieved context, output validation, and constrained prompts.

RAG (Retrieval-Augmented Generation) — A pattern where relevant documents are retrieved and injected into the model's context before generation. Allows the model to answer questions about information it was not trained on, with lower hallucination risk than relying on parametric memory.

Latency — The time from when a request is sent to when the response is received. For streaming responses, time-to-first-token is the more relevant metric for perceived responsiveness.


Common Mistakes Checklist

Before shipping any AI feature, verify:

  • [ ] All model outputs are validated before use
  • [ ] Retry logic distinguishes transient failures from persistent ones
  • [ ] A fallback path exists for when all retries fail
  • [ ] Prompts are versioned and stored in the repository
  • [ ] An evaluation dataset exists with at least 20 representative test cases
  • [ ] Pass rate on the eval dataset is above a defined threshold
  • [ ] Input length is bounded to prevent unexpected cost/latency
  • [ ] Output length is bounded with max_tokens
  • [ ] Logging captures inputs, outputs, validation failures, and retry counts
  • [ ] Cost per request is estimated and tracked
  • [ ] Latency p95 is measured under realistic load

Minimal Code Reference

These patterns are language-agnostic. Adapt to your stack.

Prompt template with injection:

SYSTEM:
You are a {role}. {task_description}
Output format: {format_spec}
Constraints: {constraints}

USER:
{user_input}

Context:
{injected_context}

Validate-or-retry loop:

result = None
for attempt in 1..MAX_ATTEMPTS:
    raw_output = call_model(prompt)
    result = validate(raw_output)
    if result.valid:
        break
    prompt = add_retry_instruction(prompt, result.error)

if result == None or not result.valid:
    return fallback_response()

return result.data

Evaluation runner:

pass_count = 0
for test_case in eval_dataset:
    output = system.run(test_case.input)
    if all(check(output) for check in test_case.checks):
        pass_count += 1

pass_rate = pass_count / len(eval_dataset)
assert pass_rate >= PASS_THRESHOLD, f"Eval failed: {pass_rate:.1%}"

Cost estimation:

estimated_input_tokens = len(system_prompt_tokens) + len(context_tokens) + len(user_input_tokens)
estimated_output_tokens = max_tokens_setting
estimated_cost_per_request = (estimated_input_tokens * INPUT_PRICE) + (estimated_output_tokens * OUTPUT_PRICE)
monthly_cost_estimate = estimated_cost_per_request * expected_monthly_requests

Run this calculation before you write the prompt. Adjust model choice, context size, and caching strategy accordingly.