RAG System Design

Building production-grade Retrieval-Augmented Generation systems

Published: January 2026 | Reading Time: 16 minutes | Category: AI & Machine Learning

Data flow visualization representing RAG architecture

Retrieval-Augmented Generation (RAG) has become the dominant approach for building knowledge-intensive AI applications. By combining the reasoning capabilities of large language models with the accuracy of retrieved information, RAG systems can answer questions about specific documents, products, or domains without the limitations of a frozen knowledge base.

This article covers the complete architecture of production RAG systems: from document ingestion through chunking, embedding, retrieval, and generation. We'll examine the engineering decisions that separate toy demonstrations from systems that handle millions of queries reliably.

The RAG Architecture Overview

A RAG system consists of two distinct pipelines: indexing (offline) and retrieval/generation (online). The indexing pipeline processes documents once and creates the searchable vector store. The retrieval pipeline handles user queries in real-time, finding relevant documents and generating responses.

INDEXING PIPELINE (Offline)
Documents → Loader → Chunker → Embedder → Vector Database
            ↓
         Metadata Store

RETRIEVAL PIPELINE (Online)
User Query → Embedder → Vector Search → Re-ranker → LLM → Response
                ↓
           Metadata Filter

Document Processing and Chunking

The first challenge in any RAG system is converting raw documents into searchable chunks. This process, called chunking or segmentation, dramatically affects retrieval quality.

Chunking Strategies

There's no universally optimal chunk size—different approaches suit different use cases:

Fixed-size chunking: Split text into chunks of N tokens/characters. Simple and fast, but may split sentences or paragraphs awkwardly.
Semantic chunking: Group sentences that share semantic similarity. More computationally expensive but produces cleaner chunks.
Recursive character splitting: Split on newlines, then sentences, then words—preserving structure where possible.
Document-aware chunking: Respect document structure (sections, pages, paragraphs) during splitting.

For most applications, a chunk size of 512-1024 tokens with 50-150 token overlaps works well. The overlap ensures context isn't lost at chunk boundaries. Smaller chunks (256-512 tokens) work better for precise factual queries; larger chunks (1024+) preserve more context for complex questions.

        Chunking Benchmark: In a 2024 study by Anyscale, semantic chunking improved retrieval accuracy by 18% over fixed-size chunking on a Q&A dataset, but increased indexing time by 4x. For 95th percentile latency requirements, fixed-size chunking may be the practical choice.
    

Metadata Extraction

Every chunk should carry metadata that enables filtering and provenance tracking:

{
  "chunk_id": "doc_123_chunk_7",
  "document_id": "doc_123",
  "document_title": "Q3 2024 Financial Report",
  "page_number": 7,
  "chunk_start_char": 15420,
  "chunk_end_char": 16280,
  "section_heading": "Revenue Analysis",
  "created_at": "2024-10-15"
}

This metadata enables powerful filtering: "Answer this question using only pages 5-10 of Q3 reports" or "Prioritize content from the most recent quarterly reports."

Embedding Models

The choice of embedding model fundamentally determines retrieval quality. Embedding models convert text into dense vectors that capture semantic meaning—texts with similar meanings cluster together in vector space.

Current State-of-the-Art Embeddings

Model	Dimensions	MTEB Benchmark	Context Length	Languages
OpenAI text-embedding-3-large	3072 (or reduced)	64.6%	8191	English+
Cohere embed-english-v3.0	1024	63.8%	512	English
BGE-large-zh-v1.5	1024	64.1%	512	Multilingual
E5-mistral-7b-ms	1024	66.6%	4096	Multilingual
Nomic-embed-text-v1.5	768	62.3%	8192	English

The MTEB (Massive Text Embedding Benchmark) covers 58 datasets across retrieval, clustering, and classification tasks. The gap between top performers is small (~4%) but can translate to significant differences in production retrieval quality.

Matryoshka Representation Learning

Modern embedding models like OpenAI's text-embedding-3 support "dimensionality reduction through truncation." The full 3072-dimensional embedding can be truncated to smaller sizes while preserving most performance. This allows trading accuracy for memory/speed savings.

Full embedding: 3072 dimensions → 100% performance
Truncated to 256: ~98% performance → 92% memory reduction
Truncated to 1024: ~99% performance → 67% memory reduction

Vector Database Options

Vector databases store embeddings and enable fast similarity search. The market has exploded with options, each with different trade-offs.

Database	Index Type	QPS (1M vectors)	P99 Latency	Managed Cloud
Pinecone	HNSW	~1000	~50ms	Yes
Weaviate	HNSW + IVF	~800	~60ms	Yes
ChromaDB	HNSW (IVF optional)	~400	~80ms	Local only
Milvus	HNSW, IVF, PQ	~2000	~30ms	Yes
Qdrant	HNSW	~1500	~40ms	Both

Pinecone offers managed infrastructure with strong consistency guarantees, making it popular for production applications where reliability trumps cost. Weaviate provides built-in hybrid search (combining vector and keyword matching) which simplifies architecture. ChromaDB is excellent for local development and prototypes—it's embedded and requires no separate service, making it the easiest to get started with.

Indexing Algorithms: HNSW, IVF, and PQ

Under the hood, vector databases use Approximate Nearest Neighbor (ANN) algorithms:

HNSW (Hierarchical Navigable Small World): Builds a multi-layer graph enabling O(log n) search complexity. Excellent recall (95-99%) at high speed. Memory-intensive but the dominant algorithm.
IVF (Inverted File Index): Partitions vectors into clusters; searches only relevant clusters. Faster but slightly lower recall than HNSW.
PQ (Product Quantization): Compresses vectors by splitting into subvectors and quantizing. Enables huge scale but reduces recall. Often combined with HNSW.

Retrieval Strategies

Top-K and Recall Trade-offs

Retrieval systems return the top-K most similar chunks. Higher K means better recall (more chances of including relevant information) but more noise and higher LLM processing costs.

K=3:  ~75% recall, low latency, minimal LLM context
K=10: ~90% recall, moderate latency, more context
K=50: ~97% recall, higher latency, significant context overhead

The right K depends on your use case. Factual Q&A typically needs K=3-5; complex analytical questions benefit from K=10-20. Beyond K=50, the additional chunks often don't help and may dilute relevant information.

Hybrid Search

Pure vector search can miss exact keyword matches that semantic similarity misses. Hybrid search combines keyword matching (BM25 or TF-IDF) with vector similarity:

hybrid_score = α × semantic_similarity + (1-α) × keyword_relevance

The α parameter (typically 0.5-0.7) controls the balance. Weaviate and Qdrant have native hybrid search support; other systems require running separate keyword and vector searches and merging results.

Query Expansion and Reformulation

User queries are often poorly phrased for retrieval. Techniques to improve:

Query expansion: Generate related queries using an LLM and retrieve for all
HyDE (Hypothetical Document Embeddings): Use an LLM to generate a hypothetical relevant passage, then retrieve against that
Query decomposition: Break complex questions into sub-questions, retrieve for each

Re-ranking

After initial retrieval, a re-ranker can dramatically improve result quality. Cross-encoders (like Cohere's rerank-3 or BGE-reranker) compute full pairwise relevance scores between the query and each retrieved chunk.

Initial Retrieval (fast, ~80% recall):
  Top 50 chunks via vector search

Re-ranking (slower, ~95% precision):
  Cross-encoder scores all 50 chunks
  Return Top 10

Final Context to LLM:
  Top 10 re-ranked chunks

Cross-encoders are slower than vector search but more accurate because they compute full attention between query and document (rather than compressed embeddings). The combination of fast vector retrieval + slow re-ranking is often optimal.

Generation: Context Assembly

The final step assembles retrieved chunks into a prompt for the LLM. Best practices:

Sort by relevance: Put most relevant chunks first (recency effect in attention)
Include citations: Reference which document/chunk each piece of information came from
Add context headers: "Based on the Q3 Financial Report..." helps the model attribute information
Limit total context: Even with 128K context windows, including 50 documents often hurts more than helps

Evaluation and Monitoring

RAG systems are notoriously difficult to evaluate. Key metrics:

Retrieval Recall: What fraction of relevant documents are retrieved? Requires ground truth labels.
Generation Accuracy: Does the answer correctly use retrieved information? Human evaluation remains gold standard.
Hallucination Rate: Does the model cite information it didn't retrieve?
End-to-end Task Accuracy: Does the RAG system actually solve the user's problem?

For retrieval, the RAGAS framework (Evaluating RAG with LLMs) provides automated metrics based on faithfulness, answer relevance, and context precision. For production, integrate retrieval metrics into your monitoring dashboard to catch embedding drift or index staleness.

Common Pitfalls

Vector DB as Bottleneck

Vector search is typically not the bottleneck—embedding generation and LLM inference are. Profile your pipeline to identify where latency actually lives.

Embedding Domain Mismatch

General-purpose embeddings trained on web text may not capture your domain's terminology. Consider fine-tuning embeddings on domain-specific data for specialized applications.

Stale Indexes

When source documents update, the vector store must be updated. Implement change detection and incremental re-indexing to avoid serving outdated information.

Conclusion

Building a production RAG system requires decisions at every layer: how to chunk, which embedding model to use, which vector database fits your scale and consistency requirements, how to combine retrieval strategies, and how to evaluate end-to-end quality.

The good news: each layer is independently solvable. Start with a simple architecture (ChromaDB + OpenAI embeddings + top-5 retrieval), measure where quality suffers, and add complexity only where needed. Most applications don't need hybrid search, re-ranking, or custom embeddings to achieve 90th percentile quality.

Retrieval-Augmented Generation Deep Dive Vector Database Detailed Guide How Large Language Models Work