16 min

Intro to RAG: Making AI Use Your Data

Embeddings, vector databases, the store-retrieve-inject pipeline, and the first real failure mode: irrelevant retrieval. What RAG is, why you need it, and how to build a version that actually works.

The Problem RAG Solves

Article five established why context grounding matters: the model hallucinate when it lacks relevant information in its context. The mitigations there assume the relevant information can be injected directly — you have one document, one product spec, one known context. What happens when your knowledge base is a hundred documents? Ten thousand?

You cannot fit ten thousand documents into one prompt. The context window has hard limits. Even if it did not, long contexts degrade model attention — relevant information buried in a 200K-token context receives less reliable attention than information near the top.

Retrieval-Augmented Generation (RAG) solves this with a different architecture: store all your documents, retrieve only the relevant ones at query time, inject them into the prompt. The model always gets a focused, relevant context rather than everything.

This article builds the foundation: what embeddings are, how vector databases work, and how to wire them together into a working pipeline. Article seven covers the engineering details — chunking strategies, top-k tuning, and filtering — that determine whether the pipeline actually retrieves the right content.


Embeddings: How Similarity Is Computed

An embedding is a dense vector representation of text. A sentence is converted into a list of floating-point numbers, typically 1,536 or 3,072 dimensions depending on the model. The key property: semantically similar text produces mathematically close vectors.

"The customer cancelled their subscription." and "A user terminated their account." share no exact words but describe the same concept. Their embeddings will be geometrically close. "The stock price fell 4% on earnings day." is on a completely different topic; its embedding will be far from both.

This geometric proximity is what enables semantic search. You convert a query to an embedding, find stored embeddings that are close to it, and return the corresponding documents. No keyword matching. No index tuning. The closeness of meaning determines relevance.

Generating an embedding:

import OpenAI from "openai";

const client = new OpenAI();

async function embed(text: string): Promise<number[]> {
  const response = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding; // Array of 1536 floats
}

text-embedding-3-small costs $0.02 per million tokens — approximately $0.00002 per 1,000-character document. Embedding your entire knowledge base costs almost nothing. Embedding each incoming query at inference time adds negligible cost.

Measuring similarity:

Cosine similarity is standard. It measures the angle between two vectors, ignoring magnitude. Range is -1 to 1 (in practice 0 to 1 for text embeddings, since embeddings are non-negative).

function cosineSimilarity(a: number[], b: number[]): number {
  const dot = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
  const magB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
  return dot / (magA * magB);
}

Score of 1.0: identical vectors. Score of 0.97+: highly semantically similar. Score below 0.75: probably different topics. These thresholds vary by embedding model and domain; calibrate against your actual data.


Vector Databases: The Retrieval Layer

Brute-force cosine similarity across 10,000 documents is feasible (milliseconds on modern hardware). Across 10 million documents, it is not. Vector databases solve this with approximate nearest neighbor (ANN) algorithms that make the search sublinear.

For development and small production deployments, you have two practical options:

Option 1: pgvector (PostgreSQL extension)

If you already run PostgreSQL, pgvector adds a vector type and ANN index. No new infrastructure, no new operational burden.

-- Enable extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table: documents and their embeddings
CREATE TABLE documents (
  id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content   TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  metadata  JSONB
);

-- ANN index (IVFFlat: good for datasets under ~1M rows)
CREATE INDEX documents_embedding_idx
  ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Query:

SELECT id, content, metadata,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 5;

The <=> operator computes cosine distance. 1 - cosine_distance = cosine_similarity, so you get scores in [0, 1].

Option 2: Purpose-built vector DB (Pinecone, Qdrant, Weaviate)

Better indexing for large corpora, built-in namespace management, managed infrastructure. Worth the operational cost above ~1M vectors.

For this article, pgvector is sufficient. The patterns are identical regardless of which vector store you use.


Building the Store-Retrieve-Inject Pipeline

Three stages. Each is a distinct component with its own concerns.

Stage 1: Store (Indexing Pipeline)

This runs offline (or on document creation) — not on every query.

import { Pool } from "pg";

const db = new Pool({ connectionString: process.env.DATABASE_URL });

async function indexDocument(
  content: string,
  metadata: Record<string, unknown>
): Promise<string> {
  const embedding = await embed(content);

  const { rows } = await db.query(
    `INSERT INTO documents (content, embedding, metadata)
     VALUES ($1, $2::vector, $3)
     RETURNING id`,
    [content, JSON.stringify(embedding), JSON.stringify(metadata)]
  );

  return rows[0].id as string;
}

For bulk indexing:

async function indexDocumentBatch(
  documents: Array<{ content: string; metadata: Record<string, unknown> }>
): Promise<void> {
  // Embed in parallel, up to 20 at a time
  const BATCH_SIZE = 20;

  for (let i = 0; i < documents.length; i += BATCH_SIZE) {
    const batch = documents.slice(i, i + BATCH_SIZE);
    const embeddings = await Promise.all(batch.map((d) => embed(d.content)));

    // Insert batch in a single transaction
    const client = await db.connect();
    try {
      await client.query("BEGIN");
      for (let j = 0; j < batch.length; j++) {
        await client.query(
          `INSERT INTO documents (content, embedding, metadata) VALUES ($1, $2::vector, $3)`,
          [batch[j].content, JSON.stringify(embeddings[j]), JSON.stringify(batch[j].metadata)]
        );
      }
      await client.query("COMMIT");
    } catch (err) {
      await client.query("ROLLBACK");
      throw err;
    } finally {
      client.release();
    }
  }
}

Batching matters for large corpora. Embedding 10,000 documents one at a time with await embed(doc) inside a loop takes ten times longer than embedding them 20 at a time with Promise.all.

Stage 2: Retrieve

Runs on every query, before the model call.

interface RetrievedDocument {
  id: string;
  content: string;
  metadata: Record<string, unknown>;
  similarity: number;
}

async function retrieve(
  query: string,
  topK: number = 5,
  minSimilarity: number = 0.75
): Promise<RetrievedDocument[]> {
  const queryEmbedding = await embed(query);

  const { rows } = await db.query<{
    id: string;
    content: string;
    metadata: Record<string, unknown>;
    similarity: number;
  }>(
    `SELECT id, content, metadata,
            1 - (embedding <=> $1::vector) AS similarity
     FROM documents
     WHERE 1 - (embedding <=> $1::vector) >= $2
     ORDER BY embedding <=> $1::vector
     LIMIT $3`,
    [JSON.stringify(queryEmbedding), minSimilarity, topK]
  );

  return rows;
}

Two parameters to tune: topK (how many documents to retrieve) and minSimilarity (the floor below which documents are excluded even if they are among the top-K). Article seven covers how to tune these. For now: topK=5 and minSimilarity=0.75 are reasonable starting points.

Stage 3: Inject

Take retrieved documents, format them, inject into the model's context.

function formatContext(documents: RetrievedDocument[]): string {
  if (documents.length === 0) {
    return "(No relevant documents found in knowledge base)";
  }

  return documents
    .map((doc, i) =>
      `[Document ${i + 1}] (relevance: ${(doc.similarity * 100).toFixed(0)}%)\n${doc.content}`
    )
    .join("\n\n---\n\n");
}

async function answerWithRAG(query: string): Promise<string> {
  const documents = await retrieve(query);
  const context = formatContext(documents);

  const response = await callModel({
    system: `You are a knowledge base assistant. Answer the user's question using ONLY the documents provided below. 
If the answer is not found in the documents, say "I cannot find this information in the knowledge base."
Do not use outside knowledge.

KNOWLEDGE BASE:
${context}`,
    user: query,
  });

  return response;
}

The formatted context includes relevance scores. This is optional — some teams strip them to simplify the prompt. But including them gives the model signal about which documents are more reliable, which measurably affects output quality on multi-document queries.


The First Failure Mode: Irrelevant Retrieval

The pipeline above works when retrieval works. The failure mode: retrieval returns documents that are syntactically similar but semantically wrong for the actual question.

Example:

Knowledge base contains product documentation. Query: "Can I export data to CSV?"

Retrieved documents (top 3):

  1. "Our export feature supports multiple formats including PDF and Excel." (similarity: 0.81)
  2. "Data can be exported from the analytics dashboard." (similarity: 0.79)
  3. "CSV files are commonly used for data interchange." (similarity: 0.77)

The third document is from a generic FAQ about data formats — completely unrelated to the user's question about the product's export capability. Its embedding is close to the query embedding because it mentions both "CSV" and "data."

The model, given this context, produces an answer that synthesizes documents 1 and 2 correctly but may also draw on document 3 to add information about CSV — which is technically in the injected context but not about the product.

Why this happens:

Embedding similarity captures semantic proximity but not the specific information relationship the user is asking about. "CSV files are used for data interchange" is genuinely semantically close to "Can I export data to CSV?" but answers a different question.

Mitigations (preview — article seven covers these in depth):

  1. Metadata filtering: Tag documents with source type, category, or product area. Filter retrieval to only documents from the relevant category before computing similarity. "Answer only comes from product_documentation category" cuts out generic FAQ noise.

  2. Minimum similarity threshold: Increasing minSimilarity from 0.75 to 0.82 often eliminates loose matches without losing genuinely relevant ones. Tune against a labeled test set.

  3. Better chunking: Document 3 above exists because the FAQ document was chunked into sections of arbitrary size, and a generic section about CSV formats ended up as a standalone chunk. Better chunking (keeping conceptually related content together) reduces this.


Small Document Search System: End-to-End

A complete working example — a documentation search API that takes a query and returns an answer grounded in stored docs.

// index.ts
import express from "express";

const app = express();
app.use(express.json());

// Endpoint: index a document
app.post("/documents", async (req, res) => {
  const { content, category, title } = req.body;

  if (typeof content !== "string" || content.trim().length === 0) {
    return res.status(400).json({ error: "content is required" });
  }

  try {
    const id = await indexDocument(content, { category, title });
    return res.status(201).json({ id });
  } catch (err) {
    return res.status(500).json({ error: "Indexing failed" });
  }
});

// Endpoint: query the knowledge base
app.post("/query", async (req, res) => {
  const { question } = req.body;

  if (typeof question !== "string" || question.trim().length === 0) {
    return res.status(400).json({ error: "question is required" });
  }

  const startMs = Date.now();

  try {
    const documents = await retrieve(question, 5, 0.75);
    const answer = await answerWithRAG(question);

    console.log(JSON.stringify({
      event: "query.success",
      docsRetrieved: documents.length,
      topSimilarity: documents[0]?.similarity ?? 0,
      latencyMs: Date.now() - startMs,
    }));

    return res.json({
      answer,
      sources: documents.map((d) => ({
        title: d.metadata.title,
        similarity: d.similarity,
      })),
    });
  } catch (err) {
    return res.status(500).json({ error: "Query failed" });
  }
});

app.listen(3001);

What to log on every query:

  • docsRetrieved: if this is consistently 0 or 1, your similarity threshold may be too high
  • topSimilarity: if consistently below 0.80, either the knowledge base lacks coverage or chunking is poor
  • Latency split: retrieval time vs embedding time vs model time — helps identify bottlenecks

What Article Seven Covers

This pipeline has three weaknesses you have traded for simplicity:

  1. Chunking: You are storing entire documents as single vectors. A long document has one embedding that averages over all its content. A query about section 3 may not retrieve the document because the embedding reflects sections 1, 2, 3, and 4 equally. Chunking splits documents into smaller pieces before embedding — article seven covers the tradeoffs.

  2. Top-K tuning: K=5 is not always right. Too few and you miss relevant context. Too many and you dilute the focused context the model needs. Article seven covers how to measure and tune this.

  3. Relevance noise: The irrelevant retrieval problem above. Article seven covers filtering strategies, reranking, and semantic similarity thresholds in detail.

The pipeline you have built here is real. It will work. Article seven makes it work reliably.