Building production-grade Retrieval-Augmented Generation systems
Retrieval-Augmented Generation (RAG) has become the dominant approach for building knowledge-intensive AI applications. By combining the reasoning capabilities of large language models with the accuracy of retrieved information, RAG systems can answer questions about specific documents, products, or domains without the limitations of a frozen knowledge base.
This article covers the complete architecture of production RAG systems: from document ingestion through chunking, embedding, retrieval, and generation. We'll examine the engineering decisions that separate toy demonstrations from systems that handle millions of queries reliably.
A RAG system consists of two distinct pipelines: indexing (offline) and retrieval/generation (online). The indexing pipeline processes documents once and creates the searchable vector store. The retrieval pipeline handles user queries in real-time, finding relevant documents and generating responses.
INDEXING PIPELINE (Offline)
Documents → Loader → Chunker → Embedder → Vector Database
↓
Metadata Store
RETRIEVAL PIPELINE (Online)
User Query → Embedder → Vector Search → Re-ranker → LLM → Response
↓
Metadata Filter
The first challenge in any RAG system is converting raw documents into searchable chunks. This process, called chunking or segmentation, dramatically affects retrieval quality.
There's no universally optimal chunk size—different approaches suit different use cases:
For most applications, a chunk size of 512-1024 tokens with 50-150 token overlaps works well. The overlap ensures context isn't lost at chunk boundaries. Smaller chunks (256-512 tokens) work better for precise factual queries; larger chunks (1024+) preserve more context for complex questions.
Every chunk should carry metadata that enables filtering and provenance tracking:
{
"chunk_id": "doc_123_chunk_7",
"document_id": "doc_123",
"document_title": "Q3 2024 Financial Report",
"page_number": 7,
"chunk_start_char": 15420,
"chunk_end_char": 16280,
"section_heading": "Revenue Analysis",
"created_at": "2024-10-15"
}
This metadata enables powerful filtering: "Answer this question using only pages 5-10 of Q3 reports" or "Prioritize content from the most recent quarterly reports."
The choice of embedding model fundamentally determines retrieval quality. Embedding models convert text into dense vectors that capture semantic meaning—texts with similar meanings cluster together in vector space.
| Model | Dimensions | MTEB Benchmark | Context Length | Languages |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (or reduced) | 64.6% | 8191 | English+ |
| Cohere embed-english-v3.0 | 1024 | 63.8% | 512 | English |
| BGE-large-zh-v1.5 | 1024 | 64.1% | 512 | Multilingual |
| E5-mistral-7b-ms | 1024 | 66.6% | 4096 | Multilingual |
| Nomic-embed-text-v1.5 | 768 | 62.3% | 8192 | English |
The MTEB (Massive Text Embedding Benchmark) covers 58 datasets across retrieval, clustering, and classification tasks. The gap between top performers is small (~4%) but can translate to significant differences in production retrieval quality.
Modern embedding models like OpenAI's text-embedding-3 support "dimensionality reduction through truncation." The full 3072-dimensional embedding can be truncated to smaller sizes while preserving most performance. This allows trading accuracy for memory/speed savings.
Full embedding: 3072 dimensions → 100% performance
Truncated to 256: ~98% performance → 92% memory reduction
Truncated to 1024: ~99% performance → 67% memory reduction
Vector databases store embeddings and enable fast similarity search. The market has exploded with options, each with different trade-offs.
| Database | Index Type | QPS (1M vectors) | P99 Latency | Managed Cloud |
|---|---|---|---|---|
| Pinecone | HNSW | ~1000 | ~50ms | Yes |
| Weaviate | HNSW + IVF | ~800 | ~60ms | Yes |
| ChromaDB | HNSW (IVF optional) | ~400 | ~80ms | Local only |
| Milvus | HNSW, IVF, PQ | ~2000 | ~30ms | Yes |
| Qdrant | HNSW | ~1500 | ~40ms | Both |
Pinecone offers managed infrastructure with strong consistency guarantees, making it popular for production applications where reliability trumps cost. Weaviate provides built-in hybrid search (combining vector and keyword matching) which simplifies architecture. ChromaDB is excellent for local development and prototypes—it's embedded and requires no separate service, making it the easiest to get started with.
Under the hood, vector databases use Approximate Nearest Neighbor (ANN) algorithms:
Retrieval systems return the top-K most similar chunks. Higher K means better recall (more chances of including relevant information) but more noise and higher LLM processing costs.
K=3: ~75% recall, low latency, minimal LLM context
K=10: ~90% recall, moderate latency, more context
K=50: ~97% recall, higher latency, significant context overhead
The right K depends on your use case. Factual Q&A typically needs K=3-5; complex analytical questions benefit from K=10-20. Beyond K=50, the additional chunks often don't help and may dilute relevant information.
Pure vector search can miss exact keyword matches that semantic similarity misses. Hybrid search combines keyword matching (BM25 or TF-IDF) with vector similarity:
hybrid_score = α × semantic_similarity + (1-α) × keyword_relevance
The α parameter (typically 0.5-0.7) controls the balance. Weaviate and Qdrant have native hybrid search support; other systems require running separate keyword and vector searches and merging results.
User queries are often poorly phrased for retrieval. Techniques to improve:
After initial retrieval, a re-ranker can dramatically improve result quality. Cross-encoders (like Cohere's rerank-3 or BGE-reranker) compute full pairwise relevance scores between the query and each retrieved chunk.
Initial Retrieval (fast, ~80% recall):
Top 50 chunks via vector search
Re-ranking (slower, ~95% precision):
Cross-encoder scores all 50 chunks
Return Top 10
Final Context to LLM:
Top 10 re-ranked chunks
Cross-encoders are slower than vector search but more accurate because they compute full attention between query and document (rather than compressed embeddings). The combination of fast vector retrieval + slow re-ranking is often optimal.
The final step assembles retrieved chunks into a prompt for the LLM. Best practices:
RAG systems are notoriously difficult to evaluate. Key metrics:
For retrieval, the RAGAS framework (Evaluating RAG with LLMs) provides automated metrics based on faithfulness, answer relevance, and context precision. For production, integrate retrieval metrics into your monitoring dashboard to catch embedding drift or index staleness.
Vector search is typically not the bottleneck—embedding generation and LLM inference are. Profile your pipeline to identify where latency actually lives.
General-purpose embeddings trained on web text may not capture your domain's terminology. Consider fine-tuning embeddings on domain-specific data for specialized applications.
When source documents update, the vector store must be updated. Implement change detection and incremental re-indexing to avoid serving outdated information.
Building a production RAG system requires decisions at every layer: how to chunk, which embedding model to use, which vector database fits your scale and consistency requirements, how to combine retrieval strategies, and how to evaluate end-to-end quality.
The good news: each layer is independently solvable. Start with a simple architecture (ChromaDB + OpenAI embeddings + top-5 retrieval), measure where quality suffers, and add complexity only where needed. Most applications don't need hybrid search, re-ranking, or custom embeddings to achieve 90th percentile quality.