Advanced RAG — Reranking, Hybrid Search and Evaluation
Reranking retrieved chunks, hybrid dense-sparse search, RAG evaluation metrics, and the patterns that separate production RAG from toy RAG.
Module 52 built a working RAG pipeline. This module explains why it fails in production and how to fix every failure mode systematically.
Naive RAG — embed query, retrieve top-k chunks by cosine similarity, inject into LLM prompt — works well in demos and poorly in production. The problems are consistent: semantic search alone misses exact keyword matches that users expect. The top retrieved chunks are often related to the query but do not actually answer it. Evaluation is absent — you do not know if the system is getting better or worse as you iterate.
A Razorpay knowledge base assistant built with naive RAG will struggle with queries like "what is error code 400?" — semantic search finds chunks about general payment errors (semantically similar) but misses the chunk that contains exactly "400" (keyword match). It will struggle with queries that require synthesising across multiple chunks. It will hallucinate when the retrieved chunks are tangentially related but do not contain the answer. And the team will have no objective way to know which of these failures is happening most.
A good research librarian does two things a bad one does not. First: when you ask "find me information about UPI payment limits," they search both the subject index (semantic) and the keyword catalogue (exact match) — not just one. Second: after gathering candidates, they skim each one to pick the three most directly relevant — they rerank. A naive RAG pipeline skips both steps. Hybrid search is the librarian searching two catalogues. Reranking is the librarian reading before recommending.
Adding a reranker alone typically improves end-to-end RAG quality by 10–25% with minimal engineering effort. It is the single highest-leverage improvement you can make to a naive RAG system.
Hybrid search — dense semantic + sparse keyword, combined with RRF
Dense retrieval (embedding similarity) excels at semantic matching — it finds chunks about "payment declined" when the query is "transaction rejected." But it fails at exact keyword matching — "error code BAD_REQUEST_ERROR" might retrieve irrelevant chunks because the semantic embedding averages across all words. Sparse retrieval (BM25) excels at exact term matching but misses synonyms and paraphrases. Hybrid search combines both signals.
Cross-encoder reranking — score every chunk against the query precisely
Bi-encoder retrieval (embedding similarity) is fast because query and document are encoded independently — you embed the query once and compare to pre-computed document embeddings. But this independence is also a weakness: the model cannot consider the specific interaction between a query word and a document word. A cross-encoder takes both query and document as a single input and computes a relevance score from their full interaction — much more accurate, but too slow to use on every document in the corpus.
The solution is a two-stage pipeline: use bi-encoder retrieval to quickly narrow down to top-100 candidates, then use a cross-encoder to precisely rerank those 100 candidates to find the true top-3. The cross-encoder only runs on 100 documents per query, not millions, so the extra latency is acceptable.
HyDE, parent-child chunking, and query decomposition
Three more techniques that consistently improve RAG quality beyond hybrid search and reranking. HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query and embeds that instead of the query — producing richer query embeddings that match document style. Parent-child chunking indexes small chunks for precision but retrieves their larger parent for context. Query decomposition breaks complex questions into sub-questions that are each easier to answer individually.
RAG evaluation — faithfulness, answer relevance, and context recall
Without evaluation, RAG iteration is guesswork. You make a change — better chunking, different embedding model, added reranking — and you have no objective measure of whether it helped. Three metrics cover the full RAG pipeline end to end. Faithfulness measures whether the answer is grounded in the context. Answer relevance measures whether the answer addresses the question. Context recall measures whether the retrieved chunks contain the answer.
Production RAG pipeline — all components integrated
Every common advanced RAG mistake — explained and fixed
You can build production RAG. The final module of Section 10 covers the complete AI agent architecture.
Advanced RAG gives your agent access to a knowledge base. Module 68 — the final module of the Generative AI section — covers the complete production agent: planning across multiple steps, calling real APIs, maintaining memory across turns, handling failures gracefully, and the architectural patterns used at companies like Razorpay, Flipkart, and Swiggy to build internal AI tools that handle thousands of requests per day.
LLMs that plan, use tools, and execute multi-step tasks autonomously. ReAct, tool calling, memory, and production agent architecture patterns.
🎯 Key Takeaways
- ✓Naive RAG (embed → cosine similarity → top-k) fails on exact keyword queries, returns tangentially relevant chunks, and has no quality measurement. The three systematic fixes are hybrid search (dense + sparse + RRF), cross-encoder reranking, and evaluation metrics that measure each failure mode independently.
- ✓Hybrid search combines dense retrieval (semantic similarity via embeddings) and sparse retrieval (BM25 keyword matching) using Reciprocal Rank Fusion. RRF score = Σ 1/(k + rank_i) with k=60. No tuning required. Best improvement comes on queries with specific technical terms (error codes, product names, API parameters) that semantic search misses.
- ✓Cross-encoder reranking is the single highest-leverage improvement to any RAG system. Two-stage pipeline: bi-encoder retrieves top-100 candidates fast (~10ms), cross-encoder scores each (query, chunk) pair precisely (~200ms for 100 docs). The cross-encoder sees both query and chunk simultaneously — much more accurate than independent embeddings.
- ✓Three advanced retrieval patterns: HyDE (embed a hypothetical answer instead of the query — matches document style better for short queries), parent-child chunking (index small precise chunks, return their full parent for LLM context), query decomposition (split complex multi-part questions into sub-questions, retrieve and answer each separately).
- ✓Three RAG evaluation metrics: faithfulness (are all answer claims supported by retrieved context — measures hallucination), answer relevance (does the answer address the question — measures off-topic responses), context recall (do retrieved chunks contain the reference answer — measures retrieval quality). Low context recall means fix retrieval. Low faithfulness means fix the LLM prompt.
- ✓When context recall is high but faithfulness is low, the retrieval is working but the LLM is ignoring the context. Fix the grounding instruction: "Answer ONLY using the numbered context. Say I do not have that information if the answer is not there." Verify by injecting a deliberate false fact into context and checking that the LLM reports it. Always supplement automated metrics with monthly human evaluation on a 50-100 query sample.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.