15 min

Cost & Latency: The First Production Pain

Token usage analysis, model selection, caching strategies, and the math that decides whether your AI feature is economically viable at scale.

When Cost Stops Being Invisible

During development, cost is not a real constraint. You run a few hundred requests. The bill is $2–8. You note that the model is fast enough and move on.

In production, cost becomes a design constraint that affects every architectural decision. At 100K requests/day with an average of 1,200 tokens per call, you are moving 120 billion tokens per month. At GPT-4o pricing ($5/1M input, $15/1M output), a 70/30 input/output split costs approximately $9,800/month. That number needs to be on your design spreadsheet before you write the first prompt.

Article one established cost as a first-class concern. This article is the engineering. Measure it, control it, optimize it.


Token Cost Model

Every LLM API bills on tokens. Understand the model before optimizing it.

Input tokens: Everything in the prompt — system prompt, conversation history, injected context, user input. Billed on every request.

Output tokens: The model's response. Billed separately, typically at 2–4× the input token rate. Minimize output tokens more aggressively than input tokens (on a per-token basis they cost more, but you usually have more control over them via max_tokens).

Per-request cost formula:

cost_per_request = (input_tokens × input_price_per_token)
                 + (output_tokens × output_price_per_token)

Token estimation:

function estimateTokens(text: string): number {
  // English: ~1 token per 4 characters (rough but reliable to within 20%)
  return Math.ceil(text.length / 4);
}

function estimateRequestCost(
  systemPrompt: string,
  context: string,
  userInput: string,
  maxOutputTokens: number,
  inputPricePerMillion: number, // e.g., 1.50 for gpt-4o-mini
  outputPricePerMillion: number // e.g., 6.00 for gpt-4o-mini
): {
  inputTokens: number;
  outputTokens: number;
  estimatedCostUsd: number;
} {
  const inputTokens =
    estimateTokens(systemPrompt) +
    estimateTokens(context) +
    estimateTokens(userInput);

  // Output estimation: assume you'll use ~60% of max_tokens on average
  const outputTokens = Math.ceil(maxOutputTokens * 0.6);

  const estimatedCostUsd =
    (inputTokens / 1_000_000) * inputPricePerMillion +
    (outputTokens / 1_000_000) * outputPricePerMillion;

  return { inputTokens, outputTokens, estimatedCostUsd };
}

Run this estimation for every prompt you write, with realistic context sizes. The system prompt is paid on every request — a 3,000-token system prompt at 1M requests/month is 3 billion input tokens, before any user content.


Measuring Actual Cost Per Request

Estimation gets you in the right order of magnitude. Measurement tells you what you are actually spending.

Log token usage from every model response:

interface RequestLog {
  requestId: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  costUsd: number;
  latencyMs: number;
  timestamp: string;
  feature: string;
}

function logModelRequest(
  response: ModelResponse,
  model: string,
  feature: string,
  startMs: number
): void {
  const pricing = MODEL_PRICING[model];
  const costUsd =
    (response.usage.input_tokens / 1_000_000) * pricing.inputPer1M +
    (response.usage.output_tokens / 1_000_000) * pricing.outputPer1M;

  const log: RequestLog = {
    requestId: response.id,
    model,
    inputTokens: response.usage.input_tokens,
    outputTokens: response.usage.output_tokens,
    costUsd,
    latencyMs: Date.now() - startMs,
    timestamp: new Date().toISOString(),
    feature,
  };

  logger.info(log);
  metrics.histogram("llm.cost_usd", costUsd, { feature, model });
  metrics.histogram("llm.input_tokens", response.usage.input_tokens, { feature });
  metrics.histogram("llm.output_tokens", response.usage.output_tokens, { feature });
}

The feature tag is critical. You need cost broken down by feature, not just total. "Summarization" and "chat assistant" have fundamentally different cost profiles. Aggregate numbers hide which features are expensive.

Key metrics to track daily:

  • cost_usd p50 / p95 / p99 per feature
  • cost_usd total per day with 7-day trend
  • input_tokens per request (drift here means context is growing)
  • output_tokens per request (drift here means model verbosity is increasing)
  • cost_per_active_user (business metric — are you scaling linearly?)

Model Selection

The cheapest model that meets your quality bar is the correct model. Using a more expensive model is not safer — it is over-engineering with a dollar cost.

Current reference pricing (approximate, check provider for current rates):

| Model | Input $/1M | Output $/1M | Relative Cost | |---|---|---|---| | GPT-4o-mini | $0.15 | $0.60 | 1× | | Claude Haiku 3.5 | $0.80 | $4.00 | 4× | | GPT-4o | $2.50 | $10.00 | 14× | | Claude Sonnet 4.5 | $3.00 | $15.00 | 20× | | Claude Opus 4 | $15.00 | $75.00 | 100× |

The 100× spread between the cheapest and most expensive models is not a performance difference — it is a capability ceiling difference. For most tasks, the cheap model is within 5–10% quality of the expensive one. Your evaluation suite from article eight measures this directly.

Decision process:

  1. Start with the cheapest model that seems plausible for your task.
  2. Run your evaluation suite (article eight). Measure pass rate.
  3. If pass rate is above your threshold: ship the cheap model.
  4. If pass rate is below threshold: identify which cases fail and why.
  5. If failures are due to reasoning complexity: try the next tier up.
  6. If failures are due to prompt quality: fix the prompt first, then re-evaluate.

Step 6 is the commonly skipped one. Teams reach for a more expensive model when the current prompt is already fixable at lower cost.


Latency Sources

Latency in LLM systems comes from multiple stages:

Request arrives
      │
      ├─ [Preprocessing: ~10-50ms]   Prompt construction, context retrieval (if RAG)
      │
      ├─ [Model Inference: ~500ms – 5s]  The actual model call
      │   ├─ Time-to-first-token: ~200–500ms (proportional to input length)
      │   └─ Generation time: scales with output length
      │
      ├─ [Postprocessing: ~5-20ms]   Parsing, validation
      │
      └─ [Retry overhead: +N × model inference time]   If retries trigger

What you control:

  • Input length → affects time-to-first-token
  • Output length → set via max_tokens; directly controls generation time
  • Retrieval latency → embedding + vector search (article seven)
  • Retry rate → improving prompt reduces retries
  • Streaming → changes perceived latency without changing actual latency

Latency measurement:

interface LatencyBreakdown {
  preprocessMs: number;
  modelMs: number;
  postprocessMs: number;
  totalMs: number;
}

async function callWithLatencyTracking(
  input: string
): Promise<{ result: unknown; latency: LatencyBreakdown }> {
  const t0 = Date.now();

  const { prompt, context } = await buildPrompt(input);
  const t1 = Date.now();

  const rawResponse = await callModel(prompt, context);
  const t2 = Date.now();

  const result = parseAndValidate(rawResponse);
  const t3 = Date.now();

  return {
    result,
    latency: {
      preprocessMs: t1 - t0,
      modelMs: t2 - t1,
      postprocessMs: t3 - t2,
      totalMs: t3 - t0,
    },
  };
}

Profile this under realistic load before setting SLA commitments. The p95 under light load is not the p95 under 50 concurrent requests.


Caching Strategies

Caching is the highest-leverage cost optimization before you touch model selection or prompt compression. A cache hit costs $0.00.

Exact Prompt Cache

Hash the full prompt (SHA-256). Store the result. On cache hit, return immediately.

import { createHash } from "crypto";
import Redis from "ioredis";

const redis = new Redis(process.env.REDIS_URL);

async function withCache<T>(
  cacheKey: string,
  fn: () => Promise<T>,
  ttlSeconds: number = 3600
): Promise<{ result: T; cached: boolean }> {
  const cached = await redis.get(cacheKey);
  if (cached) {
    return { result: JSON.parse(cached) as T, cached: true };
  }

  const result = await fn();
  await redis.setex(cacheKey, ttlSeconds, JSON.stringify(result));
  return { result, cached: false };
}

function promptHash(systemPrompt: string, userInput: string): string {
  return createHash("sha256")
    .update(systemPrompt + "|" + userInput)
    .digest("hex");
}

// Usage
const cacheKey = `llm:exact:${promptHash(SYSTEM_PROMPT, userText)}`;
const { result, cached } = await withCache(cacheKey, () => callModel(SYSTEM_PROMPT, userText));

metrics.increment("llm.cache", { hit: cached ? "true" : "false" });

Cache hit rates by use case:

  • Document analysis with pre-defined questions: 60–80%
  • FAQ chatbot with limited question set: 40–60%
  • Creative generation: < 5% (inputs are unique)
  • Structured extraction on unique documents: < 10%

TTL selection:

  • Static knowledge (documentation, product specs): 24 hours
  • Semi-dynamic content: 1–4 hours
  • Real-time business data: no caching or very short TTL (< 15 minutes)

Provider-Level Prompt Caching

Anthropic and OpenAI both offer server-side prompt caching. If your system prompt is longer than 1,024 tokens (Anthropic) or 1,000 tokens (OpenAI) and is identical across requests, the provider caches it and gives you a discount on subsequent calls (≈90% reduction on cached input tokens with Anthropic).

// Anthropic: mark cacheable sections with cache_control
const response = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: LONG_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" }, // Cache this prefix
    },
  ],
  messages: [{ role: "user", content: userInput }],
});

For a 2,000-token system prompt called 1M times per month: without caching, input cost is 2B tokens × $3/1M = $6,000. With provider caching saving 90%, input cost drops to ≈$600. That is a $5,400/month saving from a single annotation.


Output Length Control

Every token the model generates costs money and adds latency. Output length is controlled by max_tokens and by your prompt.

Setting max_tokens correctly:

function getMaxTokens(task: "summarize" | "classify" | "extract" | "generate"): number {
  const limits: Record<string, number> = {
    summarize: 400,   // 150-word summary + JSON overhead
    classify:  50,    // Single label + confidence
    extract:   256,   // Structured extraction fields
    generate:  1024,  // Longer creative/generative tasks
  };
  return limits[task] ?? 512;
}

Set max_tokens to 20–30% above the maximum output you legitimately need, not arbitrarily high. A max_tokens: 4096 on a task whose correct output is 50 tokens wastes nothing if the model respects the length constraints in your prompt — but it signals to the model that long outputs are acceptable, which increases verbosity in practice.

Prompt-level output length control:

Already covered in article three (prompt structure). Reiterate: "Return only the JSON object" and explicit word count constraints in the prompt reduce output tokens measurably. Measure the actual output token count before and after this constraint. Typical reduction: 20–35%.


The Failure Walkthrough: High Cost / Slow Responses

Scenario: A document analysis feature launches. Week one: cost is $180/day. Week three: cost is $890/day with no significant increase in users. P95 latency went from 2.3 seconds to 7.1 seconds.

Diagnosis — cost growth:

Pull input_tokens distribution from logs. Compare week one and week three. Result: median input tokens grew from 1,800 to 4,700. The context injection that previously retrieved 3 documents is now retrieving 10 after someone increased the default topK in the RAG pipeline. Context tripled, cost nearly quintupled (input tokens dominate at 70% of cost).

Fix: Revert topK to 5. Add an explicit bounds check on topK in the retrieval function with a maximum of 7. Add input_tokens to the daily cost alert.

Diagnosis — latency growth:

Pull modelMs distribution. P95 went from 1.9s to 5.8s. The model is generating longer outputs. Check output_tokens distribution — median went from 280 to 680. A prompt change three weeks ago removed the explicit word count constraint. The model infers that long responses are appropriate.

Fix: Restore explicit output length constraint in prompt. Drop max_tokens from 2048 to 512. Re-run evaluation suite to confirm quality unchanged. Deploy.

Both failures were invisible without the logging infrastructure. A cost dashboard without input_tokens and output_tokens per-feature breakdown would have shown "cost is up" with no diagnosis path.


Monthly Cost Budget Template

Before any AI feature ships, fill this in:

Feature: [name]
Expected requests/day: [N]
Expected requests/month: [N × 30]

System prompt tokens: [X]
Average context tokens: [Y]  (if RAG: K × avg_chunk_size)
Average user input tokens: [Z]
Total input tokens/request: [X + Y + Z]

Expected output tokens/request: [W] (based on max_tokens × 0.6)

Input cost/month: (X+Y+Z) × requests/month × input_price/1M
Output cost/month: W × requests/month × output_price/1M
Total estimated cost/month: [sum]

Cache hit rate estimate: [%]
Estimated cost with caching: [total × (1 - hit_rate)]

Fill this in for the model you plan to use. If the number is uncomfortable, run the same calculation on the next tier down and compare evaluation pass rates. The cheaper model is usually worth a test before assuming the expensive one is required.