Advanced RAG pipelines, multi-hop retrieval, and self-RAG approaches
While basic RAG systems—retrieve a few chunks, stuff them in context, generate—work surprisingly well, production systems often require more sophisticated approaches. This article explores advanced RAG techniques: query understanding and reformulation, multi-hop reasoning, hybrid search optimization, and the emerging self-RAG paradigm.
A mature RAG pipeline isn't a linear retrieve-generate flow. It's a decision system that determines the best strategy based on query characteristics.
QUERY ANALYSIS
↓
QUERY ROUTING ─────────────────┐
↓ │
[Simple Retrieval] │ [Complex Decomposition]
↓ ↓
[Basic Generation] [Multi-hop Retrieval]
↓ ↓
[Reranking] ────────────→ [Synthesis Generation]
↓
RESPONSE
Not every query needs the same retrieval strategy. Before retrieval begins, classify the query type to optimize the pipeline:
A lightweight classifier (or even keyword matching for simpler systems) can route queries appropriately. Factual queries might use aggressive top-K retrieval with exact keyword matching; analytical queries might retrieve broadly and rely on the LLM to synthesize.
Instead of retrieving directly against the user's query, HyDE uses an LLM to generate a hypothetical relevant passage, then retrieves against that passage. The hypothesis "focuses" the retrieval by providing more semantic context.
User Query: "What caused the 2008 financial crisis?"
Step 1: LLM generates hypothetical answer:
"Hypothetical: The 2008 financial crisis was caused by the collapse
of the subprime mortgage market, excessive leverage in the financial
system, failures in regulatory oversight, and the housing bubble burst..."
Step 2: Retrieve against the hypothetical passage
Step 3: Use actual retrieved documents to generate final answer
HyDE shows 10-15% improvement on complex queries in the-paper-assistant benchmark, though the improvement is smaller for factual queries where the user's query is already well-specified.
Complex questions often contain multiple implicit sub-questions:
Original: "Compare the revenue growth of Apple and Microsoft over the
last 5 years, and explain what drove the differences."
Decomposed:
1. "Apple annual revenue 2019-2024"
2. "Microsoft annual revenue 2019-2024"
3. "Factors affecting Apple revenue growth 2019-2024"
4. "Factors affecting Microsoft revenue growth 2019-2024"
Each sub-query is retrieved independently, and the results are combined for synthesis. This improves recall for multi-faceted questions but increases latency.
Hybrid search combines semantic (vector) search with keyword (BM25) search. The combination is powerful because:
BM25 (Best Matching 25) is a probabilistic ranking function used in traditional information retrieval. It improves over simple TF-IDF by saturating term frequency and normalizing by document length:
BM25(D, Q) = Σ IDF(qi) × (tf(qi, D) × (k₁ + 1)) / (tf(qi, D) + k₁ × (1 - b + b × |D|/avgdl))
Where:
tf = term frequency in document
IDF = inverse document frequency
k₁ = saturation parameter (typically 1.2-2.0)
b = length normalization (typically 0.75)
|D| = document length
avgdl = average document length in corpus
Once you have separate vector and BM25 scores, how do you combine them?
Reciprocal Rank Fusion (RRF):
RRF_score(d) = Σ 1 / (k + rank_i(d))
Where k = 60 (typical), rank_i = position in result list from system i
Simple and robust; doesn't require score normalization.
Normalized Score Fusion:
Combined(d) = α × norm(vector_score) + (1-α) × norm(BM25_score)
α = 0.5 to 0.7 typically; requires score normalization across systems.
Research by罵soud et al. (SIGIR 2023) found RRF to be more robust to score distribution differences, while normalized fusion performs better when systems have calibrated confidence scores.
Some questions require reasoning across multiple retrieved contexts—"multi-hop" retrieval, named after the network of reasoning steps required.
At each step, retrieve based on the question plus previous retrieved context:
Step 1: Retrieve docs relevant to "What companies did John own in 2020?"
→ [Doc A: John's tech investments, Doc B: John's stock portfolio]
Step 2: Retrieve based on "What companies did John own in 2020?" + context from Doc A
→ [Doc C: John's 2020 tax records showing stake in Acme Corp]
Step 3: Retrieve based on "What companies did John own in 2020?" + [Doc A, Doc C]
→ [Doc D: Acme Corp acquisition by Google]
Each iteration adds relevant context, enabling the final generation to reason across multiple hops.
Multi-hop retrieval can follow two paradigms:
The IRCoT paper (Iterative Retrieval augmented by Chain-of-Thought) demonstrated that interleaving retrieval with reasoning steps outperforms pure retrieve-then-reason approaches by 15-20% on multi-hop benchmarks.
For questions involving entity relationships, a graph-structured index can dramatically improve retrieval. Instead of retrieving flat document chunks, you retrieve through an entity knowledge graph:
Question: "Who was the CEO of the company that acquired Pixar?"
Graph traversal:
"Pixar" → (acquired_by) → "Disney" → (has_CEO) → "Bob Iger"
Hybrid approaches combine knowledge graphs for entity resolution with vector search for open-domain content. LangChain and LlamaIndex both support graph-augmented retrieval.
Self-RAG (Asai et al., 2023) represents a paradigm shift: instead of a separate retrieval component, the LLM itself decides when to retrieve and evaluates the quality of retrieved content.
Self-RAG fine-tunes the base model to generate special tokens that control the retrieval process:
Self-RAG Generation:
"Question: What causes seasons on Earth?
[RETRIEVE]
Retrieved: [Document about Earth's axial tilt causing seasons]
[ISREL] The document is relevant (tilt causes seasons)
[ISSUP] The response will be supported by the document
The primary cause of seasons on Earth is its axial tilt of
approximately 23.5 degrees relative to its orbital plane around
the Sun.
[ISSUP] This claim is fully supported by the retrieved document
[ISUSE] The response is directly useful to the question asked.
"
| Task | Standard RAG | Self-RAG | Improvement |
|---|---|---|---|
| Trivia (PopQA) | 75.2% | 78.4% | +3.2% |
| Multi-hop (2WikiMultiHopQA) | 41.6% | 47.8% | +6.2% |
| Long-form (ELI5) | 52.3% | 56.1% | +3.8% |
| Factuality (TruthfulQA) | 58.2% | 63.7% | +5.5% |
Self-RAG shows consistent improvements across tasks, with larger gains on multi-hop and factuality tasks. The key advantage is adaptive retrieval—the model retrieves more for difficult questions and less for simple ones.
After initial retrieval (fast but approximate), cross-encoders perform precise relevance scoring by running the full attention mechanism between query and each candidate document.
Stage 1 - Vector Search (fast):
Query embedding → Top 100 candidates via ANN
Stage 2 - Cross-Encoder Reranking (precise):
For each candidate:
CrossEncoder(query, document) → relevance_score
Return Top 10 by cross-encoder score
Cohere's rerank-3 model achieves state-of-the-art reranking performance. In benchmarks, adding cross-encoder reranking improves NDCG@10 by 15-25% over vector search alone.
For specialized domains, training a domain-specific ranker can outperform generic cross-encoders. LambdaMART and LightGBM-based rankers use multiple features:
More retrieved context isn't always better. Relevant information can be diluted by irrelevant chunks, and very long contexts can cause the model to miss important details (the "lost in the middle" problem).
Instead of passing raw retrieved chunks, compress them to extract only relevant information:
Original chunks (2,000 tokens total):
"The quarterly report shows Q3 revenue of $45.2 billion, up 8%
year-over-year. The growth was primarily driven by strong
performance in the Services segment, which grew 14%..."
Compressed context (300 tokens):
"Q3 revenue: $45.2B (+8% YoY). Main driver: Services segment
(+14%). iPhone sales: $42.6B. Geographic breakdown: Americas
42%, Europe 25%, Greater China 18%."
LLM-based compressors (RECOMP, RecSum) extract relevant facts while filtering noise. This is especially valuable when the retrieved documents contain a lot of preamble or tangentially related content.
A lighter approach: identify the specific sentences most relevant to the query and pass only those. This preserves exact phrasing from source documents, which can be important for citations.
No retrieval system is perfect. Robust RAG systems handle failures gracefully:
Beyond simple accuracy, advanced RAG requires evaluation across multiple dimensions:
| Metric | What It Measures | How to Measure |
|---|---|---|
| Retrieval Recall | % of relevant docs retrieved | Ground truth annotations |
| Faithfulness | Response matches retrieved content | LLM-based evaluation |
| Answer Relevance | Response addresses the question | LLM-based or embedding similarity |
| Citation Accuracy | Claims attributed to correct sources | Human or automated annotation |
| Context Precision | Relevant info ranked highly | Position-weighted recall |
The RAGAS framework (Gao et al., 2024) provides automated LLM-based evaluation for these metrics, enabling rapid iteration without extensive human annotation.
Advanced RAG techniques transform toy demonstrations into production-grade systems. Query classification and routing enable efficient processing for different query types. Multi-hop retrieval and graph-based approaches handle complex reasoning questions. Self-RAG provides adaptive retrieval without separate components.
The key insight: there's no one-size-fits-all RAG architecture. The best systems analyze query characteristics and apply the appropriate retrieval strategy—lightweight for simple factual queries, sophisticated for multi-hop reasoning. This adaptive approach maximizes both quality and efficiency.