March 25, 20268 min read

Cheap → Expensive LLM Routing: How to Cut AI Costs by 70%

Learn how to implement a routing layer to dispatch LLM requests to the cheapest capable model, reducing costs by up to 70% without sacrificing quality.

Cheap → Expensive LLM Routing: How to Cut AI Costs by 70%

The Cost Curve Nobody Warns You About

GPT-4o sits at roughly $5–15 per million tokens depending on input/output ratio. GPT-3.5-turbo, Mistral-7B on a self-hosted instance, or Haiku on Anthropic's API costs 10–50x less for equivalent token volume. If you're routing everything to the strongest model by default, you're not building AI infrastructure — you're burning a budget.

The failure mode is predictable: you prototype with GPT-4, it works great, you ship it, and then usage scales. At 500K requests/day with an average of 800 tokens per call, you're looking at $3,000–10,000/day for tasks where a $0.20/1M token model would have been entirely sufficient. The unit economics never pencil out, and the fix requires architectural work you should have done from day one.

Here's what makes the cost curve non-linear: most production LLM workloads are highly skewed. Based on real traffic data from document processing pipelines and conversational assistants, typically 60–80% of incoming prompts are genuinely simple — reformatting, summarization under 500 tokens, classification, FAQ lookups, templated generation. Sending those through GPT-4 is like running your static file server on a GPU cluster.

The Routing Layer

The fix is an adaptive routing layer that sits between your application and your model pool. It's not complicated in principle, but the devil is in the implementation details: classifier accuracy, fallback logic, and confidence scoring all need to be done right or you end up with degraded output quality that erases any cost savings.

Architecture overview:

User Request
     │
     ▼
┌──────────┐
│  Cache   │──── Hit ──────────────────────────────► Response
└──────────┘
     │ Miss
     ▼
┌──────────────┐
│  Classifier  │  (heuristic / embedding / ML)
└──────────────┘
     │
     ├── Simple ──► Model A (cheap: Haiku, GPT-3.5, Mistral-7B)
     │                    │
     │              Validate response
     │                    │
     │              Confidence low?──► Escalate to Model B
     │
     ├── Medium ──► Model B (mid: Sonnet, GPT-4o-mini)
     │
     └── Complex ──► Model C (expensive: GPT-4o, Opus, etc.)
                          │
                     Validate response
                          │
                       Return

The router has three responsibilities: classify the request, dispatch to the cheapest model that can handle it, and validate the output before it exits the system. Each of those steps can fail in interesting ways — we'll cover all of them.

Step 1: Classifying Prompt Complexity

This is where most implementations either over-engineer early or under-invest and get bitten. You have three options with very different cost/accuracy profiles.

Heuristic-based Classification

Fast, zero-latency, zero-cost. Usually good enough to get 70–80% routing accuracy without any model calls.

interface ComplexitySignals {
  tokenEstimate: number;
  hasCodeBlock: boolean;
  domainKeywords: string[];
  questionDepth: number; // nested sub-questions
}

function classifyHeuristic(prompt: string): "simple" | "medium" | "complex" {
  const tokens = estimateTokens(prompt); // rough: chars / 4
  const hasCode = /```|def |function |class |SELECT |FROM /.test(prompt);
  const complexKeywords = [
    "analyze", "design", "architect", "optimize", "debug",
    "compare", "explain why", "tradeoffs", "implement"
  ];
  const simpleKeywords = [
    "summarize", "translate", "format", "list", "what is", "define"
  ];

  const complexScore = complexKeywords.filter(k =>
    prompt.toLowerCase().includes(k)
  ).length;

  const simpleScore = simpleKeywords.filter(k =>
    prompt.toLowerCase().includes(k)
  ).length;

  if (tokens > 2000 || hasCode || complexScore >= 2) return "complex";
  if (complexScore === 1 || tokens > 500) return "medium";
  return "simple";
}

The pitfall here: keyword matching is brittle. "What is the best way to design a microservices architecture?" hits both "what is" (simple) and "design" (complex). You need to weight signals, not just count them. Also, prompt length alone is a weak signal — a 2,000-token RAG context with a simple question at the end is still simple.

Embedding-based Classification

Compute cosine similarity between the incoming prompt and a labeled dataset of "known complex" and "known simple" prompts. This is more robust than keywords and doesn't require retraining when you add new patterns.

async function classifyEmbedding(prompt: string): Promise<"simple" | "complex"> {
  const embedding = await embedModel.embed(prompt); // small/fast model

  const simTopK = await vectorDB.query(embedding, {
    namespace: "simple_prompts",
    topK: 5
  });
  const cplxTopK = await vectorDB.query(embedding, {
    namespace: "complex_prompts",
    topK: 5
  });

  const simScore = average(simTopK.map(r => r.score));
  const cplxScore = average(cplxTopK.map(r => r.score));

  return cplxScore > simScore + THRESHOLD ? "complex" : "simple";
}

The key is the vector store. Maintain two namespaces — simple and complex — populated from your own production logs after a labeling pass. 500–1000 labeled examples is usually enough to get meaningful signal.

Cost consideration: text-embedding-3-small costs $0.02/1M tokens. A 500-token prompt embedding costs ~$0.00001. This is negligible but still more expensive than zero-cost heuristics, so reserve it for cases where heuristics return ambiguous results (medium confidence range).

ML Classifier

A fine-tuned classifier (BERT, DistilBERT, or even a simple logistic regression over TF-IDF features) trained on labeled production data. You get F1 scores of 0.90+ with enough data.

The catch: you need labeled data, which means you need to run without a classifier first, collect logs, and label a ground truth set. Most teams skip this and regret it at scale. The minimum viable approach: log every routing decision, track fallback rate (the implicit label that the cheap model failed), and after ~10K requests you have enough signal to train a basic binary classifier.

Don't start here. Start with heuristics, instrument everything, then layer in ML.

Step 2: Model Routing Strategy

Basic Two-Tier Routing

const MODEL_TIERS = {
  simple:  "claude-haiku-3" | "gpt-3.5-turbo",
  medium:  "claude-sonnet-3-5" | "gpt-4o-mini",
  complex: "claude-opus-3" | "gpt-4o"
} as const;

function pickModel(complexity: ComplexityLevel): ModelConfig {
  return {
    model: MODEL_TIERS[complexity],
    maxTokens: TOKEN_LIMITS[complexity],
    temperature: complexity === "simple" ? 0.3 : 0.7
  };
}

Probabilistic Routing

Don't make routing binary. If your classifier returns a confidence of 0.62 for "simple", treat that as probabilistic. Hard cutoffs at 0.5 create edge cases right at the boundary that cause high fallback rates for a specific subset of prompts.

function pickModelProbabilistic(
  complexity: "simple" | "medium" | "complex",
  confidence: number
): ModelConfig {
  // If we're not sure, use the next tier up
  if (confidence < 0.75 && complexity === "simple") {
    return MODEL_TIERS["medium"];
  }
  if (confidence < 0.80 && complexity === "medium") {
    return MODEL_TIERS["complex"];
  }
  return MODEL_TIERS[complexity];
}

This is critical in production. At 10M requests/month, even a 5% miscategorization that sends simple prompts to expensive models adds up fast. But the reverse is worse: sending complex queries to cheap models that fail silently.

Multi-Stage Waterfall

For latency-tolerant workloads (async document processing, batch summarization), a waterfall approach minimizes cost:

Try cheap model
Validate output
If invalid/low confidence → retry with mid-tier
If still failing → escalate to expensive

The problem with waterfalls is cumulative latency. If cheap model takes 1.2s and fails, mid-tier takes 1.8s and also fails, you're at 3s before the expensive model even starts. For user-facing interactive flows, the cheap-first default is wrong. Use waterfall for async jobs only.

Step 3: Confidence Scoring

The hard problem: small models lie. They'll produce syntactically perfect JSON with semantically wrong answers, or hallucinate with high apparent fluency. You need a validation layer that doesn't rely on the model self-reporting failure.

Logprob-based Confidence

When available (OpenAI, some open-source models), logprobs give you per-token probability which you can aggregate into a response-level confidence score:

async function callWithLogprobs(model: string, prompt: string) {
  const response = await openai.chat.completions.create({
    model,
    messages: [{ role: "user", content: prompt }],
    logprobs: true,
    top_logprobs: 1
  });

  const avgLogprob = response.choices[0].logprobs!.content!
    .map(t => t.logprob)
    .reduce((a, b) => a + b, 0) / response.choices[0].logprobs!.content!.length;

  // Convert log probability to linear: e^avgLogprob
  const confidence = Math.exp(avgLogprob); // 0–1 range

  return {
    content: response.choices[0].message.content,
    confidence
  };
}

Caveat: logprobs measure token-level probability, not factual correctness. A model can be very "confident" (high logprobs) while being wrong. Use this as one signal, not the only signal.

Self-Evaluation Prompting

A meta-prompt that asks the model to rate its own output. More expensive (doubles your token cost for the cheap model call) but sometimes necessary:

const evalPrompt = `
You just produced this output:
---
${modelOutput}
---
On a scale of 1–10, how confident are you in this answer's accuracy and completeness?
Respond with ONLY a JSON object: {"score": N, "reason": "..."}
`;

const evalResponse = await callCheapModel(evalPrompt);
const { score } = JSON.parse(evalResponse);

The problem: models are overconfident. A model that returns a score of 7/10 might be wrong 40% of the time. Calibrate this by comparing self-reported scores against your ground truth labels. If your cheap model's self-reported 7/10 answers are correct 85% of the time, 7 is your escalation threshold. If they're correct 60% of the time, raise the threshold to 9.

Schema + Rule Validation

The most reliable signal for structured outputs. If you're expecting JSON, validate against your schema. If you're expecting a numeric answer, validate the range. These are deterministic checks that don't add latency or cost:

function validateStructuredOutput(
  output: string,
  schema: JSONSchema
): ValidationResult {
  try {
    const parsed = JSON.parse(output);
    const valid = ajv.validate(schema, parsed);
    return {
      valid,
      errors: ajv.errors,
      confidence: valid ? 1.0 : 0.0
    };
  } catch {
    return { valid: false, errors: ["JSON parse failure"], confidence: 0.0 };
  }
}

Combine all three signals into a composite score with weights tuned to your use case. Schema validation failure should always trigger escalation regardless of other signals.

Step 4: Fallback Mechanism

The naive fallback: call cheap model, if it fails, call expensive model. The failure mode: you can end up in a retry loop that costs more than just calling the expensive model first.

Structured Fallback with Circuit Breaking

async function callWithFallback(
  prompt: string,
  initialModel: ModelConfig,
  options: FallbackOptions
): Promise<LLMResponse> {
  const chain: ModelConfig[] = buildFallbackChain(initialModel);
  let lastError: Error | null = null;

  for (let i = 0; i < chain.length; i++) {
    const model = chain[i];

    // Circuit breaker: if expensive model is degraded, fail fast
    if (circuitBreaker.isOpen(model.id)) {
      continue;
    }

    try {
      const response = await callModel(model, prompt);
      const validation = await validate(response, options.schema);

      if (validation.confidence >= options.minConfidence) {
        metrics.record("fallback_tier", i); // Track which tier succeeded
        return response;
      }

      // Augment prompt on retry (not just retry blindly)
      prompt = augmentPromptForRetry(prompt, response, validation.errors);

    } catch (err) {
      lastError = err as Error;
      circuitBreaker.recordFailure(model.id);
    }
  }

  throw new Error(`All fallback tiers exhausted. Last error: ${lastError?.message}`);
}

function buildFallbackChain(initial: ModelConfig): ModelConfig[] {
  const allTiers = [CHEAP_MODEL, MEDIUM_MODEL, EXPENSIVE_MODEL];
  const startIdx = allTiers.findIndex(m => m.id === initial.id);
  return allTiers.slice(startIdx); // Only escalate, never downgrade
}

Key points:

Augment, don't just retry: When a cheap model fails, the retry prompt should include context about what went wrong. "Your previous answer didn't include the required JSON schema. Try again, this time strictly following this format: ..."
Max retry threshold: Two escalations maximum. cheap → medium → expensive → fail. Never loop at the same tier.
Circuit breaker: If GPT-4 has been returning 500s for 2 minutes, don't queue fallback requests into it. Fail fast and return a graceful error.
Track tier distribution: Logging fallback_tier=2 (hit expensive model) at a rate above 15% means your classifier is misconfigured.

Step 5: Cost vs. Latency Tradeoffs

There's no free lunch here. The three strategies:

Cheap-First Sequential

Default for most workloads. Lower cost, higher p99 latency because fallback adds cumulative time.

Best for: async pipelines, background processing, non-interactive UX
Latency profile: p50: 1.2s, p99: 4.8s (when fallback triggers)
Cost: ~40–70% of naive all-expensive routing

Direct Expensive

Call the right model immediately, no routing overhead.

Best for: hard real-time requirements (under 500ms), mission-critical outputs
Latency profile: p50: 0.8s, p99: 2.1s
Cost: baseline (1x)

Parallel Race Strategy

Fire cheap and expensive models simultaneously, return whichever finishes first and validates successfully. Cancel the other.

async function raceModels(prompt: string): Promise<LLMResponse> {
  const cheapCall = callModel(CHEAP_MODEL, prompt)
    .then(r => validateOrThrow(r));

  const expensiveCall = callModel(EXPENSIVE_MODEL, prompt)
    .then(r => validateOrThrow(r));

  try {
    // Return first valid response
    const result = await Promise.any([cheapCall, expensiveCall]);

    // Cancel the losing call (implementation depends on your client)
    result.source === "cheap"
      ? expensiveCall.cancel?.()
      : cheapCall.cancel?.();

    return result;
  } catch {
    throw new Error("Both models failed");
  }
}

Best for: latency-critical interactive features where you can absorb ~2x cost on average
Latency profile: p50: 0.7s (cheap wins), p99: 1.0s (expensive wins)
Cost: ~1.4–1.8x naive (you sometimes pay for both calls)

The race strategy only makes sense when: (a) your p99 latency requirement is tight, (b) cheap model wins often enough (>50% of requests), and (c) you can afford to occasionally pay for two model calls. At high volume, the math usually doesn't work — you're better off accepting occasional fallback latency.

Step 6: Caching

The biggest overlooked cost saver. Before your router even runs, a cache hit costs nothing.

Exact Prompt Caching

Hash the prompt (SHA-256 is fine), store the response. Works for templated generation, FAQ-style queries, and repeated analytical tasks. In a document processing pipeline with 10K unique documents but 50 standard analysis templates, you'll see 60–70% cache hit rates.

async function withCache<T>(
  key: string,
  fn: () => Promise<T>,
  ttlSeconds = 3600
): Promise<T> {
  const cached = await redis.get(key);
  if (cached) {
    metrics.increment("cache.hit");
    return JSON.parse(cached) as T;
  }

  metrics.increment("cache.miss");
  const result = await fn();
  await redis.setex(key, ttlSeconds, JSON.stringify(result));
  return result;
}

// Usage
const cacheKey = `llm:${sha256(prompt)}`;
return withCache(cacheKey, () => router.route(prompt));

Semantic Cache

Embed the prompt and query a vector store. If a semantically similar prompt was answered before (cosine similarity > 0.97), return the cached response.

async function semanticCacheGet(
  prompt: string,
  threshold = 0.97
): Promise<CachedResponse | null> {
  const embedding = await embed(prompt);
  const results = await vectorDB.query(embedding, { topK: 1 });

  if (results[0]?.score >= threshold) {
    metrics.increment("semantic_cache.hit");
    return results[0].metadata.response;
  }
  return null;
}

The threshold matters enormously. At 0.95, you'll get false positives — semantically similar but meaningfully different prompts returning wrong cached answers. At 0.99, your hit rate drops to near-zero. 0.97 is a reasonable default; tune it based on your domain.

Stale cache risk: LLM responses can become stale if your underlying data changes (e.g., a cached answer about a product that's been discontinued). Add TTLs proportional to data freshness requirements. For static knowledge (coding questions, math), a 24-hour TTL is fine. For anything business-domain specific, 1–4 hours max.

Production Implementation

Putting it together in a complete flow:

// Core routing pipeline
async function routeRequest(
  prompt: string,
  userId: string,
  options: RoutingOptions = {}
): Promise<LLMResponse> {

  const requestId = generateId();
  const startTime = Date.now();

  try {
    // Layer 1: Exact cache
    const exactCacheKey = `llm:exact:${sha256(prompt)}`;
    const exactCached = await cache.get(exactCacheKey);
    if (exactCached) {
      return { ...exactCached, cached: true, tier: "cache" };
    }

    // Layer 2: Semantic cache
    const semanticCached = await semanticCacheGet(prompt);
    if (semanticCached) {
      return { ...semanticCached, cached: true, tier: "semantic_cache" };
    }

    // Layer 3: Classify
    const complexity = await classify(prompt);
    const model = pickModelProbabilistic(complexity.label, complexity.confidence);

    // Layer 4: Route and validate
    const response = await callWithFallback(prompt, model, {
      schema: options.expectedSchema,
      minConfidence: options.minConfidence ?? 0.80,
      maxRetries: 2
    });

    // Layer 5: Cache result
    await cache.setex(exactCacheKey, options.cacheTtl ?? 3600,
      JSON.stringify(response));

    // Layer 6: Log routing decision for tuning
    await logger.logDecision({
      requestId,
      userId,
      prompt: prompt.substring(0, 200), // Don't log full PII-containing prompts
      complexity: complexity.label,
      classifierConfidence: complexity.confidence,
      modelUsed: response.model,
      fallbackCount: response.fallbackCount,
      latencyMs: Date.now() - startTime,
      cost: estimateCost(response)
    });

    return response;

  } catch (err) {
    metrics.increment("routing.error");
    throw err;
  }
}

Async Queue for Heavy Tasks

For jobs that can tolerate latency (document analysis, batch enrichment), don't block the request thread:

// Enqueue async job
async function enqueueHeavyTask(
  payload: TaskPayload
): Promise<{ jobId: string }> {
  const jobId = await queue.push("llm_tasks", {
    ...payload,
    routing: { preferAsync: true, timeout: 30_000 }
  });
  return { jobId };
}

// Worker
queue.process("llm_tasks", async (job) => {
  const result = await routeRequest(job.data.prompt, job.data.userId, {
    minConfidence: 0.85,
    cacheTtl: 7200
  });
  await db.storeResult(job.data.resultKey, result);
  await webhooks.notify(job.data.callbackUrl, { jobId: job.id, result });
});

Observability and Iteration

You cannot tune a routing system without metrics. The minimum viable dashboard:

Cost metrics:

cost_per_request by model tier (track daily moving average)
cost_per_user_cohort (some users reliably hit expensive tiers)
fallback_rate — if this exceeds 15–20%, your classifier needs work

Quality metrics:

user_regeneration_rate — if users are clicking "regenerate" more on cheap-model responses, you're under-routing
schema_validation_failure_rate by tier — high failure on cheap model is a signal to raise the threshold
explicit_complaint_rate — correlated with routing tier

Classifier metrics:

Distribution of complexity_label over time (drift detection)
classifier_confidence histogram — a bimodal distribution (lots of 0.6 and 0.95, nothing in between) is healthy; a wide flat distribution means your classifier is weak

Iteration loop:

Weekly: review fallback rate by complexity bucket
Monthly: label 500 random requests, measure classifier accuracy against ground truth
Quarterly: retrain classifier on accumulated labels

If your fallback rate for "simple" prompts is above 10%, your classifier is too aggressive — you're sending complex prompts to cheap models. If it's below 2%, you're being too conservative and paying for unnecessary expensive model calls.

Common Mistakes

Over-engineering the classifier before you have data. The first version should be pure heuristics. Build the logging infrastructure first, collect 4–6 weeks of data, then invest in ML. Teams that skip to ML classifiers day one spend weeks on infrastructure and ship nothing.

No evaluation loop. Routing without measurement is guessing. If you're not tracking per-tier accuracy and fallback rate, you have no idea if the system is working. Instrument everything from day one.

Trusting LLM self-confidence. The self-evaluation prompt ("rate your confidence 1–10") is a useful signal but not a reliable one. Models are systematically overconfident on domains where they're weakest. Use it in combination with structural validation, not as a primary gate.

Ignoring cumulative latency. A waterfall that adds 2–3 seconds on fallback may be acceptable in isolation, but at p99 it creates timeouts and UX degradation. Profile your fallback path under realistic load before deploying.

Routing to cheapest model for latency-critical paths. Some applications (voice interfaces, real-time autocomplete) cannot absorb fallback latency. These should bypass the routing layer entirely.

Not invalidating the cache correctly. If your prompt template changes or your knowledge base updates, stale semantic cache entries will return wrong answers. Tie cache TTL to data freshness and build a cache invalidation path for emergency use.

Impact and Cost Math

A realistic distribution after a well-tuned router:

| Tier | % of Requests | Avg Cost/1K Tokens | Daily Calls (1M total) | |------|--------------|-------------------|------------------------| | Cache hit | 25% | $0.00 | 250,000 | | Cheap model | 50% | $0.20 | 500,000 | | Medium model | 17% | $1.50 | 170,000 | | Expensive model | 8% | $10.00 | 80,000 |

At 800 average tokens per call:

Without routing (all expensive): 1M × 0.8K × $10.00/1M = $8,000/day
With routing: (500K × $0.16) + (170K × $1.20) + (80K × $8.00) = $80 + $204 + $640 = ~$924/day

That's an 88% cost reduction in this scenario. Even conservative assumptions (50% cheap, 30% medium, 20% expensive) give you 60–70% reduction over naive routing.

The 25% cache hit rate is achievable in conversational assistants and document processing workloads. In one-shot analytical tools, expect 5–15%.

When Not to Use This

Mission-critical output with no tolerance for error. Medical diagnosis support, legal document drafting, financial risk assessment — anywhere a wrong answer has serious consequences. Route directly to your best model, full stop. The cost savings don't justify the risk surface.

Low-volume applications. Below ~50K requests/month, the engineering cost of building and maintaining a routing system exceeds the cost savings. The crossover point depends on your model mix, but a $500/month LLM bill rarely justifies a routing layer.

Ultra-low latency requirements. If your p99 SLA is under 300ms, you can't afford classification overhead or waterfall fallback. Profile carefully — even a fast heuristic classifier adds 2–5ms, and if that pushes you over budget, the routing layer is the wrong tool.

Highly specialized domains with little training data. If every prompt in your system requires deep domain reasoning (advanced scientific analysis, complex legal interpretation), the "60–80% are simple" assumption breaks down. Your actual cheap-model routing rate might be 10–15%, and you spend more on classifier infrastructure than you save on model costs.

← All articles