RAG — Retrieval-Augmented Generation
Vector databases, semantic search, chunking strategies, and the full RAG pipeline from document to answer. Build a Razorpay knowledge base Q&A system.
Fine-tuning teaches a model new behaviour. RAG gives a model access to documents it has never seen — without any training at all.
A customer asks Razorpay's support bot: "What is the settlement cycle for international payments?" The LLM does not know — this is specific to Razorpay's current policy which changes quarterly and was never in the training data. Fine-tuning would require retraining every time the policy changes. That is expensive, slow, and impractical.
RAG solves this differently. Before answering, it retrieves the most relevant sections from Razorpay's documentation. Those sections are injected into the LLM's context window alongside the question. The LLM answers from the retrieved context — not from its weights. Update the documentation and the answers update instantly. No retraining. No fine-tuning.
RAG is now the standard architecture for any application that needs an LLM to answer questions about private, recent, or frequently-updated information. Swiggy's internal tool answering HR policy questions, Flipkart's product Q&A bot, CRED's financial terms assistant — all are RAG systems.
An open-book exam vs a closed-book exam. Fine-tuning is memorising everything before the exam — works until the syllabus changes. RAG is the open-book exam — you bring the textbook and look up answers during the test. The student (LLM) still needs to be smart enough to find and synthesise the right information — but they do not need to memorise every fact in advance.
The retrieval step is critical — bringing the wrong textbook chapters into context produces wrong answers even with a perfect LLM. Most RAG failures are retrieval failures, not generation failures.
Two phases — indexing (offline) and retrieval+generation (online)
Chunking — how you split documents determines retrieval quality
Chunking is the single biggest lever in RAG quality. Too small: each chunk lacks context — the retrieved snippet is meaningless without surrounding text. Too large: the relevant sentence is buried in noise — the LLM hallucinates because it cannot find the answer in a 2000-token wall of text. The goal: each chunk should be semantically self-contained and contain exactly one answerable concept.
Split every N characters or tokens. Overlap of 10-20% between chunks.
Split on paragraphs first, then sentences, then words — trying to preserve semantic units.
Embed consecutive sentences. Split where embedding distance jumps — indicating a topic change.
Use headings, sections, and document structure to define chunks. Each section = one chunk.
FAISS, Chroma, and Pinecone — which vector database to use
A vector database stores embedding vectors and supports approximate nearest neighbour (ANN) search — finding the k most similar vectors to a query vector in milliseconds, even across millions of documents. Every RAG system uses one.
Complete RAG system — Razorpay knowledge base Q&A
Grounding instruction prevents hallucination. Without it, the LLM will blend retrieved context with its own (potentially wrong) training knowledge.
RAG with OpenAI, Groq, and local models — production patterns
Every common RAG mistake — explained and fixed
You can build a RAG system. Next: get better answers by engineering better prompts.
RAG handles the retrieval problem — getting relevant context into the LLM's window. But the quality of the generated answer also depends heavily on how the prompt is structured. Zero-shot, few-shot, chain-of-thought, ReAct — each prompting pattern consistently improves LLM outputs for different task types. Module 53 covers the patterns that actually work in production with real before/after examples.
Zero-shot, few-shot, chain-of-thought, ReAct — the patterns that consistently improve LLM outputs, with real before/after examples for every technique.
🎯 Key Takeaways
- ✓RAG gives an LLM access to documents it has never seen without any training. The pipeline has two phases: indexing (chunk documents → embed → store in vector DB, run once) and querying (embed question → vector search → retrieve top-k chunks → inject into LLM prompt → generate answer, runs on every request).
- ✓Chunking is the single biggest lever in RAG quality. Fixed-size chunking is simple but breaks semantic boundaries. Recursive character splitting is the practical default. Semantic chunking (split where embedding similarity drops) produces the best retrieval quality. Use 500–1000 tokens per chunk with 10–20% overlap to prevent key information from being split.
- ✓FAISS is the standard in-memory vector library for small-to-medium datasets. Chroma adds metadata filtering and automatic persistence. Pinecone is managed cloud for production scale. Always use the same embedding model and normalisation at index time and query time — mismatches silently produce wrong retrieval results.
- ✓The RAG prompt must include a strong grounding instruction: "Answer ONLY using the context below. If the answer is not in the context, say you do not have that information." Without this the LLM blends retrieved context with its own training knowledge and hallucinates. Set temperature=0 for factual Q&A.
- ✓Most RAG failures are retrieval failures not generation failures. If the LLM gives wrong answers, check what was retrieved first — print the top-k chunks. HyDE (Hypothetical Document Embeddings) improves retrieval for short or ambiguous queries: generate a hypothetical answer first, embed that, use it as the search vector.
- ✓Add citation tracking in production: number the retrieved chunks in the prompt and ask the LLM to cite which sources it used in its answer. This makes hallucination visible — if the LLM cites source [3] but source [3] does not contain the claimed fact, it hallucinated. Enables automatic fact-checking post-generation.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.