17 min

Building Your First Real RAG System

Chunking strategies, top-k tuning, context window management, and the noise problem — how to move from a retrieval pipeline that sometimes works to one that works reliably.

From Pipeline to Production

Article six built the store-retrieve-inject pipeline. You have documents indexed, a similarity search running, and an LLM generating answers with retrieved context. End-to-end, it works. The issue: "works" here means "produces a plausible answer on inputs I tested manually." Which is exactly what iteration 1 of the summarizer looked like in article two before you added validation and retries.

RAG systems fail silently. The model does not return a 500 when retrieval is poor — it generates a confident answer with bad context. You do not find out programmatically. Users stop trusting the system, or you catch it in an evaluation run. This article covers the engineering that closes that gap.

Three problems to solve: chunking (how you store documents affects retrieval quality), top-k tuning (how many documents to retrieve), and noise filtering (how to prevent irrelevant context from degrading output).


Why Chunking Matters

In article six, you stored each document as a single unit with one embedding. The embedding is a compressed representation of the entire document's semantic content. For a 50-sentence document, it is the average of all 50 sentences' meaning.

The problem: a query about sentence 37 may not retrieve this document because the embedding reflects the aggregate, diluted content of all 50 sentences. Sentence 37's specific meaning is drowned out.

Chunking splits documents into smaller pieces before embedding. Each chunk has its own focused embedding. A query about sentence 37 can now retrieve the chunk containing it directly.

The engineering question: what is the right chunk size?

Fixed-Size Chunking

The simplest implementation. Split every document into chunks of N tokens, with M tokens of overlap between adjacent chunks.

function chunkByTokens(
  text: string,
  chunkSize: number = 500,
  overlap: number = 50
): string[] {
  // Rough token estimate: 4 chars per token
  const charsPerChunk = chunkSize * 4;
  const overlapChars = overlap * 4;
  const chunks: string[] = [];

  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + charsPerChunk, text.length);
    chunks.push(text.slice(start, end));
    start += charsPerChunk - overlapChars;
    if (start >= text.length) break;
  }

  return chunks;
}

The overlap is not optional. Without overlap, a sentence split across two chunk boundaries exists in neither chunk's embedding with its full context. 50–100 token overlap ensures boundary content appears in at least one chunk in full.

Default parameters: 500 tokens per chunk, 50 token overlap. This is widely cited and broadly reasonable. It does not mean it is right for your data.

When fixed-size chunking fails:

A technical documentation article has three sections: an introductory paragraph, a code example that starts mid-chunk and ends in the next chunk, and a summary paragraph. The code example ends up split across chunks with 50 tokens of overlap. A query about the code example retrieves a chunk containing lines 1–15 and another containing lines 10–25, with redundant middle content. Neither chunk contains the full algorithm.

This is why semantic chunking exists.

Semantic Chunking

Chunk at natural semantic boundaries: paragraphs, sections, headings. Preserve the structure the document's author considered meaningful.

function chunkBySections(document: string): string[] {
  // Split on double newlines (paragraphs) or markdown headings
  const sections = document
    .split(/\n\n+|(?=^#{1,3} )/m)
    .map((s) => s.trim())
    .filter((s) => s.length > 50); // Remove stubs

  // Merge sections that are too short (under ~100 tokens)
  const MIN_CHUNK_CHARS = 400;
  const chunks: string[] = [];
  let current = "";

  for (const section of sections) {
    if (current.length + section.length < MIN_CHUNK_CHARS) {
      current += (current ? "\n\n" : "") + section;
    } else {
      if (current) chunks.push(current);
      current = section;
    }
  }
  if (current) chunks.push(current);

  return chunks;
}

For structured documents (markdown, HTML, PDFs with heading hierarchy), semantic chunking consistently outperforms fixed-size. For unstructured documents (raw text, OCR output, email bodies), fixed-size with overlap is more reliable.

Practical decision: use semantic chunking for content you control (documentation, knowledge bases, reports). Use fixed-size for content ingested from outside sources where structure is inconsistent.

Chunk Metadata

Every chunk should carry metadata about its source. This serves two purposes: metadata filtering during retrieval, and source attribution in responses.

interface DocumentChunk {
  id: string;
  content: string;
  embedding?: number[];
  metadata: {
    documentId: string;
    documentTitle: string;
    chunkIndex: number;
    totalChunks: number;
    section?: string;
    category?: string;
    createdAt: string;
  };
}

async function indexDocumentWithChunks(
  documentId: string,
  title: string,
  content: string,
  category: string
): Promise<void> {
  const chunks = chunkBySections(content);

  const chunkObjects: DocumentChunk[] = chunks.map((chunk, i) => ({
    id: `${documentId}-chunk-${i}`,
    content: chunk,
    metadata: {
      documentId,
      documentTitle: title,
      chunkIndex: i,
      totalChunks: chunks.length,
      category,
      createdAt: new Date().toISOString(),
    },
  }));

  // Embed and insert
  const embeddings = await Promise.all(chunkObjects.map((c) => embed(c.content)));

  for (let i = 0; i < chunkObjects.length; i++) {
    await db.query(
      `INSERT INTO document_chunks (id, content, embedding, metadata)
       VALUES ($1, $2, $3::vector, $4)
       ON CONFLICT (id) DO UPDATE SET content = $2, embedding = $3::vector, metadata = $4`,
      [
        chunkObjects[i].id,
        chunkObjects[i].content,
        JSON.stringify(embeddings[i]),
        JSON.stringify(chunkObjects[i].metadata),
      ]
    );
  }
}

The ON CONFLICT DO UPDATE is important: re-indexing a document when its content changes should update existing chunks, not create duplicates.


Top-K Tuning

K=5 in article six was not derived from anything. It was a starting point. Here is how to tune it.

What K controls:

  • Higher K → more context → higher chance of including the relevant chunk → longer prompt → more noise
  • Lower K → less context → lower chance of including relevant chunk → shorter prompt → less noise

These are in tension. The right K minimizes noise while maximizing relevant coverage.

Step 1: Measure recall at K.

Take a labeled test set: 50 questions where you know the correct answer and which chunk contains it. Retrieve at various K values:

interface RecallTest {
  question: string;
  relevantChunkIds: string[]; // ground truth
}

async function measureRecall(
  testCases: RecallTest[],
  kValues: number[]
): Promise<Map<number, number>> {
  const results = new Map<number, number>();

  for (const k of kValues) {
    let hits = 0;
    for (const tc of testCases) {
      const retrieved = await retrieve(tc.question, k, 0.70);
      const retrievedIds = new Set(retrieved.map((r) => r.id));
      if (tc.relevantChunkIds.some((id) => retrievedIds.has(id))) {
        hits++;
      }
    }
    results.set(k, hits / testCases.length);
  }

  return results;
}

Run this for K ∈ . You will typically see:

| K | Recall | |---|--------| | 1 | 0.62 | | 3 | 0.78 | | 5 | 0.86 | | 10 | 0.91 | | 15 | 0.92 | | 20 | 0.93 |

Recall flattens after a point — adding more chunks retrieves the same relevant documents but adds increasingly irrelevant ones. The inflection point (here, around K=5 to K=10) is your target range.

Step 2: Measure answer quality vs. context noise.

Recall measures whether the right chunk was retrieved. It does not measure whether the model used it correctly. Too many chunks dilute the context. Run your evaluation suite (introduced properly in article eight) across K values and compare model answer quality:

K=5: recall 0.86, answer quality 0.82
K=10: recall 0.91, answer quality 0.80  ← more recall, worse answers
K=7: recall 0.89, answer quality 0.84  ← best tradeoff

This pattern — where increasing K improves recall but degrades answer quality — is common. The model is not magic; it gets confused by more context, especially when later items are less relevant. Find the K where the tradeoff is best for your domain.


Context Window Management

A common oversight: if your documents are large and K is not small, you can blow past the context window.

Before every model call, estimate total context size:

function estimateTokens(text: string): number {
  // Approximation: 1 token ≈ 4 characters in English
  return Math.ceil(text.length / 4);
}

function buildContext(
  documents: RetrievedDocument[],
  systemPrompt: string,
  query: string,
  maxContextTokens: number = 6000
): string {
  let totalTokens = estimateTokens(systemPrompt) + estimateTokens(query) + 500; // buffer
  const includedDocs: string[] = [];

  // Sort by similarity descending (most relevant first)
  const sorted = [...documents].sort((a, b) => b.similarity - a.similarity);

  for (const doc of sorted) {
    const docTokens = estimateTokens(doc.content);
    if (totalTokens + docTokens > maxContextTokens) break;
    includedDocs.push(doc.content);
    totalTokens += docTokens;
  }

  return includedDocs.join("\n\n---\n\n");
}

This builds context greedily from most-relevant to least-relevant, stopping when the token budget is exhausted. The most relevant chunks always fit; less relevant ones are dropped first.

Context position matters. Research on LLM attention shows models pay more attention to content near the beginning and end of context than in the middle — the "lost in the middle" problem. Put the most relevant chunk first. If you have very long context, consider placing the most critical chunk at the top AND repeating a summary of it at the bottom.


Filtering and Reranking

Similarity search is not always enough. Two refinements that significantly improve precision.

Metadata Filtering

Before similarity search, filter to only documents matching category, date range, or other criteria:

SELECT id, content, metadata,
       1 - (embedding <=> $1::vector) AS similarity
FROM document_chunks
WHERE metadata->>'category' = $2       -- filter first
  AND 1 - (embedding <=> $1::vector) >= $3
ORDER BY embedding <=> $1::vector
LIMIT $4;

In TypeScript:

async function retrieveFiltered(
  query: string,
  filter: { category?: string; documentId?: string },
  topK: number = 7,
  minSimilarity: number = 0.75
): Promise<RetrievedDocument[]> {
  const queryEmbedding = await embed(query);

  const conditions: string[] = ["1 - (embedding <=> $1::vector) >= $2"];
  const params: unknown[] = [JSON.stringify(queryEmbedding), minSimilarity];
  let paramIdx = 3;

  if (filter.category) {
    conditions.push(`metadata->>'category' = $${paramIdx++}`);
    params.push(filter.category);
  }
  if (filter.documentId) {
    conditions.push(`metadata->>'documentId' = $${paramIdx++}`);
    params.push(filter.documentId);
  }

  const { rows } = await db.query(
    `SELECT id, content, metadata, 1 - (embedding <=> $1::vector) AS similarity
     FROM document_chunks
     WHERE ${conditions.join(" AND ")}
     ORDER BY embedding <=> $1::vector
     LIMIT $${paramIdx}`,
    [...params, topK]
  );

  return rows as RetrievedDocument[];
}

Metadata filtering is the highest-leverage improvement to retrieval precision in most real deployments. "Only search product documentation, not general knowledge" eliminates entire categories of irrelevant retrieval without any ML.

Reranking

After retrieval, rerank chunks using a cross-encoder model that jointly attends to the query and each chunk — unlike bi-encoder embedding similarity, which embeds them independently.

Cross-encoder reranking is expensive (one model call per chunk). It is worth it for high-stakes retrieval where precision matters more than latency.

A lightweight alternative: keyword overlap as a reranking signal.

function rerank(
  query: string,
  documents: RetrievedDocument[]
): RetrievedDocument[] {
  const queryTerms = new Set(
    query.toLowerCase().split(/\W+/).filter((w) => w.length > 3)
  );

  return documents
    .map((doc) => {
      const docTerms = doc.content.toLowerCase().split(/\W+/);
      const overlap = docTerms.filter((t) => queryTerms.has(t)).length;
      const boostScore = overlap / queryTerms.size; // 0–1 range
      return {
        ...doc,
        similarity: doc.similarity * 0.7 + boostScore * 0.3, // blended score
      };
    })
    .sort((a, b) => b.similarity - a.similarity);
}

This blended score combines embedding similarity (70%) with keyword overlap (30%). Simple, zero-latency, and measurably improves precision for keyword-heavy queries where exact term matching matters (proper nouns, product names, technical terms).


The Failure Walkthrough: Noisy and Redundant Chunks

Scenario: A customer support RAG system over 200 product documentation articles. Users ask questions; the system answers from the docs.

Observed behavior after launch:

  1. Query: "How do I reset my password?" — correct answer retrieved and returned. Good.
  2. Query: "What are the API rate limits?" — retrieves three chunks from the rate limits doc, plus two generic API overview chunks. The answer correctly describes rate limits but pads it with generic overview content the user did not ask for. Quality is degraded.
  3. Query: "Does the product support HIPAA compliance?" — returns a confident "yes" from a chunk in a 2-year-old version of the docs. Current docs have a different answer.

Root causes:

Case 2: Fixed-size chunking split the rate limits section across three chunks with 50-token overlap. All three are retrieved (they are all highly similar to the query). They are mostly redundant. The duplicated context wastes tokens and dilutes the model's attention.

Fix: Switch to semantic chunking so the rate limits section is a single chunk. Add a deduplication step that reduces redundant chunks from the same document:

function deduplicate(
  documents: RetrievedDocument[],
  maxPerSource: number = 2
): RetrievedDocument[] {
  const countBySource = new Map<string, number>();
  return documents.filter((doc) => {
    const sourceId = doc.metadata.documentId as string;
    const count = countBySource.get(sourceId) ?? 0;
    if (count >= maxPerSource) return false;
    countBySource.set(sourceId, count + 1);
    return true;
  });
}

Case 3: Stale chunk retrieved because the 2-year-old doc was never re-indexed after update. No TTL on the index, no re-indexing pipeline.

Fix: add updatedAt to document metadata. Add a validation step to the retrieval pipeline:

// Flag chunks older than 90 days for review
const STALENESS_THRESHOLD_MS = 90 * 24 * 60 * 60 * 1000;

function filterStale(
  documents: RetrievedDocument[],
  warnOnly = true
): RetrievedDocument[] {
  return documents.filter((doc) => {
    const age = Date.now() - new Date(doc.metadata.createdAt as string).getTime();
    if (age > STALENESS_THRESHOLD_MS) {
      logger.warn({ event: "stale_chunk_retrieved", chunkId: doc.id, ageMs: age });
      return !warnOnly; // If warnOnly, still include but log it
    }
    return true;
  });
}

And re-index documents when they are updated — treat the indexing pipeline as a write path, not a one-time import.


What is Next

Your RAG system now retrieves relevant content reliably (most of the time). But how do you know it is reliable? "Most of the time" is not a metric. Article eight builds the evaluation infrastructure: how to define what correct looks like, how to measure it systematically, and how to catch regressions before users do.