RAG System Design

Building production-grade Retrieval-Augmented Generation systems

Published: January 2026 | Reading Time: 16 minutes | Category: AI & Machine Learning

Data flow visualization representing RAG architecture

Retrieval-Augmented Generation (RAG) has become the dominant approach for building knowledge-intensive AI applications. By combining the reasoning capabilities of large language models with the accuracy of retrieved information, RAG systems can answer questions about specific documents, products, or domains without the limitations of a frozen knowledge base.

This article covers the complete architecture of production RAG systems: from document ingestion through chunking, embedding, retrieval, and generation. We'll examine the engineering decisions that separate toy demonstrations from systems that handle millions of queries reliably.

The RAG Architecture Overview

A RAG system consists of two distinct pipelines: indexing (offline) and retrieval/generation (online). The indexing pipeline processes documents once and creates the searchable vector store. The retrieval pipeline handles user queries in real-time, finding relevant documents and generating responses.

INDEXING PIPELINE (Offline)
Documents → Loader → Chunker → Embedder → Vector Database
            ↓
         Metadata Store

RETRIEVAL PIPELINE (Online)
User Query → Embedder → Vector Search → Re-ranker → LLM → Response
                ↓
           Metadata Filter
    

Document Processing and Chunking

The first challenge in any RAG system is converting raw documents into searchable chunks. This process, called chunking or segmentation, dramatically affects retrieval quality.

Chunking Strategies

There's no universally optimal chunk size—different approaches suit different use cases:

For most applications, a chunk size of 512-1024 tokens with 50-150 token overlaps works well. The overlap ensures context isn't lost at chunk boundaries. Smaller chunks (256-512 tokens) work better for precise factual queries; larger chunks (1024+) preserve more context for complex questions.

Chunking Benchmark: In a 2024 study by Anyscale, semantic chunking improved retrieval accuracy by 18% over fixed-size chunking on a Q&A dataset, but increased indexing time by 4x. For 95th percentile latency requirements, fixed-size chunking may be the practical choice.

Metadata Extraction

Every chunk should carry metadata that enables filtering and provenance tracking:

{
  "chunk_id": "doc_123_chunk_7",
  "document_id": "doc_123",
  "document_title": "Q3 2024 Financial Report",
  "page_number": 7,
  "chunk_start_char": 15420,
  "chunk_end_char": 16280,
  "section_heading": "Revenue Analysis",
  "created_at": "2024-10-15"
}
    

This metadata enables powerful filtering: "Answer this question using only pages 5-10 of Q3 reports" or "Prioritize content from the most recent quarterly reports."

Embedding Models

The choice of embedding model fundamentally determines retrieval quality. Embedding models convert text into dense vectors that capture semantic meaning—texts with similar meanings cluster together in vector space.

Current State-of-the-Art Embeddings

Model Dimensions MTEB Benchmark Context Length Languages
OpenAI text-embedding-3-large 3072 (or reduced) 64.6% 8191 English+
Cohere embed-english-v3.0 1024 63.8% 512 English
BGE-large-zh-v1.5 1024 64.1% 512 Multilingual
E5-mistral-7b-ms 1024 66.6% 4096 Multilingual
Nomic-embed-text-v1.5 768 62.3% 8192 English

The MTEB (Massive Text Embedding Benchmark) covers 58 datasets across retrieval, clustering, and classification tasks. The gap between top performers is small (~4%) but can translate to significant differences in production retrieval quality.

Matryoshka Representation Learning

Modern embedding models like OpenAI's text-embedding-3 support "dimensionality reduction through truncation." The full 3072-dimensional embedding can be truncated to smaller sizes while preserving most performance. This allows trading accuracy for memory/speed savings.

Full embedding: 3072 dimensions → 100% performance
Truncated to 256: ~98% performance → 92% memory reduction
Truncated to 1024: ~99% performance → 67% memory reduction
    

Vector Database Options

Vector databases store embeddings and enable fast similarity search. The market has exploded with options, each with different trade-offs.

Database Index Type QPS (1M vectors) P99 Latency Managed Cloud
Pinecone HNSW ~1000 ~50ms Yes
Weaviate HNSW + IVF ~800 ~60ms Yes
ChromaDB HNSW (IVF optional) ~400 ~80ms Local only
Milvus HNSW, IVF, PQ ~2000 ~30ms Yes
Qdrant HNSW ~1500 ~40ms Both

Pinecone offers managed infrastructure with strong consistency guarantees, making it popular for production applications where reliability trumps cost. Weaviate provides built-in hybrid search (combining vector and keyword matching) which simplifies architecture. ChromaDB is excellent for local development and prototypes—it's embedded and requires no separate service, making it the easiest to get started with.

Indexing Algorithms: HNSW, IVF, and PQ

Under the hood, vector databases use Approximate Nearest Neighbor (ANN) algorithms:

Retrieval Strategies

Top-K and Recall Trade-offs

Retrieval systems return the top-K most similar chunks. Higher K means better recall (more chances of including relevant information) but more noise and higher LLM processing costs.

K=3:  ~75% recall, low latency, minimal LLM context
K=10: ~90% recall, moderate latency, more context
K=50: ~97% recall, higher latency, significant context overhead
    

The right K depends on your use case. Factual Q&A typically needs K=3-5; complex analytical questions benefit from K=10-20. Beyond K=50, the additional chunks often don't help and may dilute relevant information.

Hybrid Search

Pure vector search can miss exact keyword matches that semantic similarity misses. Hybrid search combines keyword matching (BM25 or TF-IDF) with vector similarity:

hybrid_score = α × semantic_similarity + (1-α) × keyword_relevance
    

The α parameter (typically 0.5-0.7) controls the balance. Weaviate and Qdrant have native hybrid search support; other systems require running separate keyword and vector searches and merging results.

Query Expansion and Reformulation

User queries are often poorly phrased for retrieval. Techniques to improve:

Re-ranking

After initial retrieval, a re-ranker can dramatically improve result quality. Cross-encoders (like Cohere's rerank-3 or BGE-reranker) compute full pairwise relevance scores between the query and each retrieved chunk.

Initial Retrieval (fast, ~80% recall):
  Top 50 chunks via vector search

Re-ranking (slower, ~95% precision):
  Cross-encoder scores all 50 chunks
  Return Top 10

Final Context to LLM:
  Top 10 re-ranked chunks
    

Cross-encoders are slower than vector search but more accurate because they compute full attention between query and document (rather than compressed embeddings). The combination of fast vector retrieval + slow re-ranking is often optimal.

Generation: Context Assembly

The final step assembles retrieved chunks into a prompt for the LLM. Best practices:

Evaluation and Monitoring

RAG systems are notoriously difficult to evaluate. Key metrics:

For retrieval, the RAGAS framework (Evaluating RAG with LLMs) provides automated metrics based on faithfulness, answer relevance, and context precision. For production, integrate retrieval metrics into your monitoring dashboard to catch embedding drift or index staleness.

Common Pitfalls

Vector DB as Bottleneck

Vector search is typically not the bottleneck—embedding generation and LLM inference are. Profile your pipeline to identify where latency actually lives.

Embedding Domain Mismatch

General-purpose embeddings trained on web text may not capture your domain's terminology. Consider fine-tuning embeddings on domain-specific data for specialized applications.

Stale Indexes

When source documents update, the vector store must be updated. Implement change detection and incremental re-indexing to avoid serving outdated information.

Conclusion

Building a production RAG system requires decisions at every layer: how to chunk, which embedding model to use, which vector database fits your scale and consistency requirements, how to combine retrieval strategies, and how to evaluate end-to-end quality.

The good news: each layer is independently solvable. Start with a simple architecture (ChromaDB + OpenAI embeddings + top-5 retrieval), measure where quality suffers, and add complexity only where needed. Most applications don't need hybrid search, re-ranking, or custom embeddings to achieve 90th percentile quality.