January 29, 202517 min

Prompt Engineering That Actually Matters

The gap between a prompt that works sometimes and one that works reliably. Structured prompt design, system vs user roles, output schemas, and using examples — with concrete before/after comparisons on the same task.

The Problem With Most Prompt Advice

Most prompt engineering advice is about talking to a chatbot. "Be specific," "give it context," "ask step by step." That is fine for personal use. It is not engineering.

Production prompts are software artifacts. They control the behavior of a non-deterministic component that runs at scale. A change to the prompt is a code change with a potential blast radius across all live requests. Treating it as anything else leads to the failure mode described in article one: a developer tweaks phrasing to fix one user's complaint, quietly breaks format compliance for 12% of inputs, and ships the regression undetected.

This article is about prompt design as an engineering discipline — the structural decisions that make the difference between "sometimes works" and "reliably works."

We are building on the summarizer from article two. The examples use that context, but the patterns apply universally.

System Prompts vs User Prompts: More Than a Convention

Every major LLM API distinguishes between the system role and the user role. This is not cosmetic.

System prompt: Defines persistent context, identity, behavioral constraints, and output format requirements. Sent once per session (or per request in stateless APIs). The model treats this as its operating instructions.

User prompt: The per-request input. What the user (or your application) is asking for in this specific call.

The practical engineering distinction: put everything that does not change between requests in the system prompt. Format requirements, role definition, behavioral constraints, output schema, examples of correct responses. These belong in the system prompt because:

They do not need to repeat across conversation turns.
In APIs that support prompt caching (Anthropic's cache control, OpenAI's prompt caching), the system prompt can be cached at the token level — you pay for it once, not on every call.
The model's attention on behavioral constraints is higher when they appear in the system role than when buried in a long user message.

The wrong way:

// BAD: format instructions mixed with user input
messages: [{
  role: "user",
  content: `Please summarize the following text. Return JSON with summary, wordCount, and keyPoints fields. The summary should be under 150 words. Include 3 key points. Text: ${userText}`
}]

The right way:

// GOOD: format instructions in system; input in user
system: `You are a document summarizer.

Return ONLY valid JSON. No other text.
Schema:
{
  "summary": "<string, max 150 words>",
  "wordCount": <integer>,
  "keyPoints": ["<string>", "<string>", "<string>"]
}

Rules:
- Summary contains only information from the source document.
- keyPoints is always an array of exactly 3 strings.
- Do not include any text before or after the JSON object.`,

messages: [{
  role: "user",
  content: userText
}]

The second version is cleaner for the model, testable in isolation (you can change the system prompt and test it against a fixed set of user messages), and cacheable.

Prompt Structure That Scales

Think of a production system prompt as having four layers. Each layer serves a different purpose.

┌────────────────────────────────────────────┐
│  1. Role & Identity                        │  Who you are
├────────────────────────────────────────────┤
│  2. Task Description                       │  What you do
├────────────────────────────────────────────┤
│  3. Constraints & Edge Cases               │  How you behave in specific situations
├────────────────────────────────────────────┤
│  4. Output Schema                          │  Exactly what you return
└────────────────────────────────────────────┘

Identity is not just flavor text. It shapes the model's sense of its responsibilities and affects tone, verbosity, and caution level. "You are a JSON-only data extraction engine" produces different output characteristics than "You are a helpful assistant that summarizes documents" even when the task description is identical.

Task description states what you are doing in this call. Keep it to one or two sentences. If you need more, you are probably combining two tasks into one prompt — split them.

Constraints are where most of the reliability engineering lives. Common constraint categories:

Coverage constraints: "Only use information present in the source document. Do not infer or speculate."
Format constraints: "Do not include markdown. Do not wrap JSON in code fences."
Scope constraints: "If the document is not in English, return {\"error\": \"unsupported language\"}."
Edge case handling: "If the document has no identifiable key points, return \"keyPoints\": []."

Output schema comes last, stated precisely. Include the type of every field. Include range constraints (min/max length, integer vs float). If the schema is complex, include an example object.

Here is the summarizer prompt structured this way:

You are a document summarization engine. You produce structured JSON output only.

Task: Summarize the document in the user message.

Constraints:
- Summaries must be factually grounded in the source document only. Do not add, infer, or speculate.
- If the document is fewer than 50 words, return {"error": "document too short"}.
- Do not include any text before or after the JSON object.
- Do not use markdown code fences.
- keyPoints is always an array. If no distinct points exist, return [].

Output schema (return this exact structure):
{
  "summary": "<string: 50–150 words>",
  "wordCount": <integer: word count of summary field>,
  "keyPoints": ["<string>", ...]
}

Comparing Three Prompt Styles

The same task, three different approaches. The task: extract the deadline, assignee, and priority from a project update message.

Input:

Client call at 3pm today confirmed: John is taking point on the dashboard migration, 
needs to be production-ready by end of Q1. Mark it high priority.

Expected output structure:

{ "deadline": "end of Q1", "assignee": "John", "priority": "high" }

Style 1 — Vague

Extract the deadline, assignee, and priority from this message.

Actual model output (tested on 20 inputs):

15/20 return correct JSON
3/20 return prose: "The deadline is end of Q1, assigned to John, priority is high."
2/20 return JSON but with extra fields added ("notes": "dashboard migration")

Problems:

No format constraint → model freely returns prose
No schema → model invents field names
No edge case handling → what happens when there is no deadline? The model invented "deadline": "not specified" on one input and omitted the field on another

Style 2 — Schema Only

Return JSON with keys: deadline, assignee, priority.
Extract the deadline, assignee, and priority from this message.

Actual output (20 inputs):

18/20 return valid JSON with correct keys
1/20 returns JSON wrapped in markdown fences
1/20 returns null for deadline even though a deadline exists in the message

Progress: Format compliance improved. Still missing edge case handling and exact value constraints.

Style 3 — Structured (Role + Task + Constraints + Schema)

You are a structured data extraction engine. You extract fields from text and return JSON.

Task: Extract deadline, assignee, and priority from the message.

Constraints:
- Return only the JSON object. No other text, no markdown.
- If a field cannot be determined from the text, use null.
- priority must be one of: "low", "medium", "high", or null.
- deadline must be a string exactly as stated in the message, or null.
- assignee must be the person's name as stated in the message, or null.

Output:
{"deadline": <string|null>, "assignee": <string|null>, "priority": "low"|"medium"|"high"|null}

Actual output (20 inputs):

20/20 return valid JSON
20/20 correct keys
19/20 correct values
1 failure: model normalized "end of Q1" to "March 31" (not what we asked for — added an explicit constraint in the next revision)

The delta between style 1 and style 3 is not cleverness. It is structure. The model is not smarter in style 3. It has fewer degrees of freedom and explicit behavior specified for the cases that previously caused failures.

Using Examples (Few-Shot Prompting)

Few-shot examples are the highest-leverage technique for complex output formats. They teach the model the expected response pattern by demonstration rather than instruction.

When to use few-shot:

Output format is complex or unusual
The task requires nuanced judgment (sentiment classification, tone matching)
Instruction-only prompts produce 80–90% compliance but not higher
You need specific reasoning behavior (chain-of-thought)

When NOT to use few-shot:

Simple structured extraction where a clear schema is sufficient
Long examples push important instructions out of the model's attention window
Examples add token cost on every request without improving output quality

Rule of thumb: add examples only after measuring that they improve your evaluation pass rate. Do not add them by default.

Example placement in the prompt:

System:
You are a document tagging engine. Label documents with relevant topic tags.

Return JSON:
{"tags": ["<tag>", ...], "confidence": <0.0–1.0>}

Rules:
- Return 1–5 tags.
- Tags are lowercase, hyphenated where multi-word (e.g., "machine-learning").
- confidence reflects how clearly the document belongs to these topics.

Examples:

Input: "We deployed a new Kubernetes cluster in us-east-1 using EKS..."
Output: {"tags": ["kubernetes", "aws", "infrastructure", "devops"], "confidence": 0.92}

Input: "The Q3 revenue figures show a 12% increase over forecast..."
Output: {"tags": ["finance", "reporting", "quarterly-results"], "confidence": 0.88}

Input: "Our team has been exploring mindfulness practices for focus..."
Output: {"tags": ["productivity", "wellness"], "confidence": 0.71}

The examples do three things: demonstrate the tag naming convention (lowercase, hyphenated), show the confidence calibration range (not always 0.95+), and implicitly demonstrate that low-confidence inputs get fewer tags.

Chain-of-Thought: When You Need It, When You Don't

Chain-of-thought (CoT) is a technique where you instruct the model to reason through a problem before producing the final answer. It significantly improves accuracy on tasks requiring multi-step reasoning.

When CoT helps:

Classification tasks where the decision is not obvious from surface features
Tasks with multiple conditions that must all be checked
Extraction tasks where the correct answer requires interpreting ambiguous language

When CoT does not help (and costs you):

Simple format transformations — CoT adds output tokens with no quality benefit
Tasks where the model is already near ceiling accuracy with a direct prompt
Latency-sensitive paths — CoT increases output tokens and proportionally increases inference time

Implementing CoT for extraction:

You are an extraction engine. Reason carefully, then return your final answer.

For each field:
1. Identify the relevant span in the message.
2. Determine if it clearly maps to the field or is ambiguous.
3. Set the value accordingly.

After reasoning, return the JSON on a line that starts with "RESULT:".

Example:
Message: "Sarah to finish the report by Friday"
Reasoning:
- deadline: "by Friday" → clear, value = "Friday"
- assignee: "Sarah" → clear, value = "Sarah"  
- priority: not mentioned → value = null
RESULT: {"deadline": "Friday", "assignee": "Sarah", "priority": null}

Note the output format: the RESULT: prefix lets your postprocessing reliably extract just the JSON even when CoT reasoning precedes it. This is cleaner than regex-scraping a bare JSON object from reasoning text.

The tradeoff: CoT doubles or triples your output tokens on average. For a task that runs at 100K requests/day, measure the quality improvement against the token cost before using it in production.

Prompt Versioning and Regression Testing

Prompts change. Requirements change. You discover a failure mode and add a constraint. Each change is a potential regression.

Minimum viable prompt versioning:

// prompts/summarize.ts
export const SUMMARIZE_V3 = {
  version: "3.0",
  system: `You are a document summarization engine...`, // full prompt
  updatedAt: "2025-01-22",
  changelog: "Added edge case handling for short documents (<50 words)"
};

Store prompts as versioned constants in your codebase. Do not construct them dynamically from concatenated strings scattered across files — that makes tracking changes impossible.

Before each prompt change, run your evaluation suite:

// eval/summarize.eval.ts
const TEST_CASES = [
  {
    input: "The new product launches next week...", // 80 words
    checks: [
      (r: SummaryResult) => r.wordCount >= 20 && r.wordCount <= 150,
      (r: SummaryResult) => r.keyPoints.length >= 2,
      (r: SummaryResult) => !r.summary.includes("product launches next week") // no verbatim copy
    ]
  },
  {
    input: "Short.", // edge case: very short document
    checks: [
      (r: any) => r.error === "document too short"
    ]
  },
  // ... 18 more cases
];

async function runEval(prompt: typeof SUMMARIZE_V3): Promise<number> {
  let passed = 0;
  for (const tc of TEST_CASES) {
    try {
      const result = await summarize(tc.input, prompt);
      if (tc.checks.every(check => check(result))) passed++;
    } catch { /* failed check */ }
  }
  return passed / TEST_CASES.length;
}

You do not need a framework for this. Twenty test cases, a loop, and a pass rate. Run it before and after any prompt change. If the rate drops, the change is a regression regardless of whether it fixed the specific case you were targeting.

The Failure Case: Vague Prompts → Bad Outputs

Scenario: A classifier receives support tickets and routes them to billing, technical, or general teams. The prompt:

Classify this support ticket into a category. Categories: billing, technical, general.
Return the category name.

What breaks:

After a week in production, analysis of logs shows:

23% of responses are "billing" or "technical" with no further formatting — correct
14% are "Category: billing" or "The category is: technical" — downstream parsing fails
8% are "This appears to be a billing issue" — no category extraction possible
5% are "billing/technical" — ambiguous, not a valid category

The response is technically correct (right category) but in four different formats.

Fix — add the structural constraints:

You are a support ticket classifier. Classify tickets into exactly one category.
Valid categories: "billing", "technical", "general"

Return ONLY the category string — no other text, no punctuation, no explanation.

If the ticket fits multiple categories, choose the primary one.
If the ticket cannot be classified, return "general".

Examples:
"My invoice is wrong" → billing
"The app crashes on login" → technical
"I want to change my email" → general

After fix: 97% single-correct-category responses across 200 test inputs. The three failures were all ambiguous edge cases where even human classifiers disagreed — not a prompt engineering failure, a task definition problem.

The improvement came entirely from:

Specifying exactly what to return ("ONLY the category string")
Adding explicit handling for ambiguous cases ("choose the primary one")
Adding three examples that demonstrated the expected output form

What to Learn Next

You now have structured prompts that produce reliable outputs. The next layer: what happens when reliable outputs are not enough? In article four, we look at JSON schema enforcement at the API level, validation pipelines that go beyond parsing, and retry strategies that are smarter than "try again."

The pattern from article two (retry with modified prompt) was the simplest version. Article four is the complete reliability system.

Quick Reference: Prompt Design Checklist

Before shipping any prompt to production, verify:

[ ] Role/identity defined in the system prompt
[ ] Task description is one to two sentences
[ ] All possible output formats are explicitly specified
[ ] Edge cases are handled with explicit constraints (null fields, empty arrays, error objects)
[ ] Examples are included for complex formats or nuanced tasks
[ ] Prompt is stored as a versioned constant in the repository
[ ] Evaluation suite passes at ≥95% before deploying the prompt
[ ] Each example in the prompt is a real case from your domain, not a generic illustration

← All articles