RAG (Retrieval-Augmented Generation) Learning Notes
1. What Is RAG
RAG (Retrieval-Augmented Generation) is a technical paradigm that combines information retrieval with large language model generation. The core idea is straightforward: before letting an LLM answer a question, first retrieve relevant content from an external knowledge base, feed that retrieved content as context to the model, and then have the model generate a response grounded in that real data.
1.1 Why We Need RAG
LLMs have several inherent limitations:
| Problem | Description |
|---|---|
| Knowledge cutoff | Training data has a time limit — the model can’t answer questions about recent events |
| Hallucination | The model can “fabricate” content that sounds plausible but is wrong |
| Missing private data | Internal company documents and proprietary knowledge are completely unknown to the model |
| Poor traceability | The model can’t tell users where an answer came from |
RAG addresses these issues through external retrieval, making LLM responses verifiable and up-to-date.
1.2 RAG vs Fine-tuning vs Prompt Engineering
| Dimension | RAG | Fine-tuning | Prompt Engineering |
|---|---|---|---|
| Knowledge updates | Just update the knowledge base | Requires retraining | Depends on context window |
| Cost | Medium (retrieval system + inference) | High (training compute) | Low |
| Use cases | Knowledge-intensive Q&A | Style/format adaptation | Simple task guidance |
| Explainability | High (traceable sources) | Low | Medium |
| Data privacy | Private data stays in-house | Data baked into model weights | Depends on implementation |
In practice, all three approaches are often used together.
2. Basic RAG Architecture
2.1 Standard Pipeline
1 | User question |
2.2 Two Main Phases
Offline Phase (Indexing)
Transform raw data into a searchable knowledge base:
1 | Raw documents → Chunking → Embedding → Store in vector store |
Online Phase (Querying)
The real-time pipeline when a user asks a question:
1 | User question → Embed question → Vector similarity search → Context assembly → LLM generation |
3. Core Components in Detail
3.1 Document Chunking
The chunking strategy directly affects retrieval quality.
| Strategy | Description | Use Case |
|---|---|---|
| Fixed-length chunking | Split by character/token count with overlap | General purpose |
| Paragraph/section chunking | Leverage document structure (headings, line breaks) | Structured documents |
| Semantic chunking | Dynamically determine split points based on semantic similarity | Long text, mixed content |
| Recursive chunking | Recursively split by a hierarchy of separators | General purpose; LangChain’s default strategy |
Key parameters:
- chunk_size: Size of each chunk, typically 256–1024 tokens
- chunk_overlap: Overlap between adjacent chunks, typically 10%–20% of chunk_size
1 | # LangChain recursive character text splitter example |
Rule of thumb: chunks that are too large → imprecise retrieval with too much noise; chunks that are too small → lost context and incomplete answers. You’ll need to tune based on your actual data.
3.2 Embedding
Convert text into fixed-dimensional dense vectors for similarity computation.
Common embedding models:
| Model | Provider | Dimensions | Notes |
|---|---|---|---|
| text-embedding-3-small/large | OpenAI | 1536/3072 | Strong general performance |
| bge-large-zh | BAAI | 1024 | Excellent Chinese language performance |
| GTE (General Text Embedding) | Alibaba | 768/1024 | Multilingual |
| E5-mistral-7b | Microsoft | 4096 | High quality, expensive inference |
| Cohere Embed v3 | Cohere | 1024 | Multilingual, supports compression |
1 | # OpenAI Embedding example |
3.3 Vector Store
A specialized database for storing and retrieving vectors.
| Database | Type | Notes |
|---|---|---|
| Chroma | Embedded | Lightweight, great for prototyping |
| FAISS | Library (Meta) | High performance, in-memory, suitable for small to medium scale |
| Milvus | Distributed | Production-grade, supports billions of vectors |
| Pinecone | Cloud service | Fully managed, zero ops |
| Weaviate | Standalone service | Supports hybrid search (vector + keyword) |
| Qdrant | Standalone service | Rust implementation, high performance |
| pgvector | PostgreSQL extension | Reuses the PostgreSQL ecosystem |
Similarity metrics:
- Cosine Similarity: Most common; measures directional similarity
- Euclidean Distance (L2): Measures absolute distance
- Inner Product: Similar to cosine; vectors need to be normalized
3.4 Retrieval Strategies
Basic Retrieval
Dense Retrieval: Embed the query and find nearest neighbors in the vector store.
1 | # Pseudocode: basic vector retrieval |
Advanced Retrieval
Hybrid Search: Combine vector retrieval and keyword retrieval (BM25) to get the best of both.
1 | Query ──→ Vector retrieval → Results A |
Multi-recall + Re-ranking (Retrieve & Re-rank):
1 | Query → Multiple retrieval strategies → Candidate set → Cross-Encoder re-ranking → Refined results |
Re-ranking models (e.g., bge-reranker, Cohere Rerank) perform deep relevance scoring on query-document pairs, significantly boosting precision.
3.5 Context Assembly and Generation
Assemble retrieved content into the Prompt and hand it to the LLM to generate a response.
A common Prompt template:
1 | You are a knowledge Q&A assistant. Answer the user's question based on the reference material below. |
Key points:
- Explicitly instruct the model to “answer based on the reference material” to reduce hallucination
- Add an instruction like “say you don’t know if you don’t know”
- Control the total amount of injected context to avoid exceeding the model’s context window limit
4. Advanced RAG Techniques
4.1 Query Rewriting and Expansion
A user’s raw question is often not the optimal retrieval query.
| Technique | Description |
|---|---|
| Query Rewriting | Have the LLM rewrite the question to make it more retrieval-friendly |
| HyDE (Hypothetical Document Embedding) | Have the LLM generate a hypothetical answer first, then use that answer’s embedding for retrieval |
| Multi-Query | Decompose one question into multiple sub-questions, retrieve each separately, then merge |
| Step-back Prompting | Have the model first raise a broader, more general question to gather background knowledge |
1 | # HyDE example |
4.2 GraphRAG
Microsoft’s GraphRAG approach enhances retrieval with a knowledge graph.
Process:
- Extract entities and relationships from documents to build a knowledge graph
- Apply community detection on the graph to generate summaries at different levels
- At query time, leverage both vector retrieval and the graph structure for reasoning
Advantage: excels at answering global questions that require cross-document reasoning (e.g., “summarize the main points across all documents”).
4.3 Adaptive RAG (Adaptive / Self-RAG)
Let the model autonomously decide whether retrieval is needed:
1 | User question → LLM decides if retrieval is needed |
The Self-RAG paper introduces Reflection Tokens that let the model self-evaluate retrieval quality and response reliability during generation.
4.4 Multi-hop RAG
Handle complex questions that require multi-step reasoning:
1 | Question: "Is the founder of Company A still working at Company B?" |
4.5 Agentic RAG
Use RAG as one of an Agent’s tools, with the Agent autonomously planning retrieval strategies.
1 | User question → Agent thinks |
Tool definition example:
1 | tools = [ |
5. Evaluation Framework
5.1 Retrieval Quality Metrics
| Metric | Description |
|---|---|
| Precision@K | Fraction of the top-K results that are relevant |
| Recall@K | Fraction of all relevant documents that were retrieved |
| MRR | Reciprocal rank of the first relevant result |
| nDCG | Gain metric that accounts for ranking position |
5.2 Generation Quality Evaluation
RAGAS is the mainstream framework for evaluating RAG systems:
| Metric | Dimension | Description |
|---|---|---|
| Faithfulness | Faithfulness | Whether the response is consistent with retrieved content (no hallucination) |
| Answer Relevance | Answer relevance | Whether the response actually addresses the question |
| Context Precision | Context precision | Fraction of retrieved content that is useful |
| Context Recall | Context recall | Whether all information needed for the answer was retrieved |
1 | # RAGAS evaluation example |
5.3 End-to-End Evaluation
Ultimately, evaluation has to come back to the actual business scenario:
- Human evaluation: sample scoring for accuracy, completeness, and fluency
- LLM-as-Judge: use a stronger model to evaluate the output of a weaker model
- A/B testing: compare user satisfaction between different RAG configurations in production
6. Engineering Best Practices
6.1 Data Processing
- Data cleaning: Remove garbled text, duplicates, and formatting noise
- Metadata tagging: Add source, timestamp, category, and other metadata to each chunk to enable filtered retrieval
- Multi-format parsing: PDF, Word, HTML, Markdown, tables, and images (OCR) each have their own parsing approaches
6.2 Performance Optimization
| Optimization Area | Methods |
|---|---|
| Retrieval latency | Vector index optimization (HNSW, IVF), cache hot queries |
| Recall | Multi-recall, query expansion, increase top_k |
| Precision | Re-ranking, add metadata filtering |
| Context utilization | Compress irrelevant content, inject only key passages |
| Cost | Embedding caching, small model for coarse ranking + large model for fine ranking |
6.3 Common Issues and Solutions
| Issue | Root Cause | Solution |
|---|---|---|
| Relevant content not retrieved | Chunks too granular / poor embedding quality | Adjust chunking strategy, swap embedding model |
| Retrieved but response is wrong | Too much noise in context | Reduce injected context, add re-ranking |
| Response hallucination | Model ignores context and generates freely | Strengthen Prompt instructions, lower temperature |
| Incomplete response | Information scattered across multiple chunks | Multi-hop retrieval, increase chunk_size |
| High latency | Double overhead from retrieval + LLM | Async retrieval, streaming output, caching |
6.4 Tech Stack Reference
Prototype / fast start:
- Embedding: OpenAI text-embedding-3-small
- Vector store: Chroma / FAISS
- Framework: LangChain / LlamaIndex
Production deployment:
- Embedding: BGE / GTE (self-hostable, data stays on-premise)
- Vector store: Milvus / Weaviate / pgvector
- Framework: LlamaIndex (more flexible) or a custom-built pipeline
- Re-ranking: bge-reranker-v2-m3 / Cohere Rerank
7. Framework Comparison
| Framework | Language | Highlights | Best For |
|---|---|---|---|
| LangChain | Python/JS | Broadest ecosystem, rich modules, quick to learn | Rapid prototyping, general RAG |
| LlamaIndex | Python | RAG-focused, rich data connectors | Data-intensive applications |
| Haystack | Python | Production-oriented, clean Pipeline design | Enterprise search Q&A |
| Semantic Kernel | C#/Python/Java | Microsoft product, tight Azure ecosystem integration | .NET stack projects |
| DSPy | Python | Programmatic Prompt optimization, auto-tuning | Academic research, Prompt engineering |
8. Learning Resources
Papers
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Original RAG paper, Facebook AI, 2020)
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
- GraphRAG: Unlocking LLM Discovery on Complex Data (Microsoft, 2024)
Hands-on Projects
- LangChain RAG templates: the official repo provides reference implementations for multiple RAG architectures
- LlamaIndex official examples: covers RAG patterns from basic to advanced
- RAGAS: RAG evaluation framework with complete evaluation examples included
These notes map out the knowledge landscape of RAG from foundational concepts to advanced practice. The essence of RAG is “have the LLM look things up before answering” — simple in concept, but every step (chunking, embedding, retrieval, generation, evaluation) has a wealth of engineering nuance. I’d recommend starting with the simplest vector-retrieval RAG, then gradually introducing more advanced techniques as you hit concrete problems. Don’t jump straight into GraphRAG + Agent + Multi-hop from day one — excessive complexity is often why RAG projects fail.