RAG (Retrieval-Augmented Generation) Learning Notes

RAG (Retrieval-Augmented Generation) Learning Notes

1. What Is RAG

RAG (Retrieval-Augmented Generation) is a technical paradigm that combines information retrieval with large language model generation. The core idea is straightforward: before letting an LLM answer a question, first retrieve relevant content from an external knowledge base, feed that retrieved content as context to the model, and then have the model generate a response grounded in that real data.

1.1 Why We Need RAG

LLMs have several inherent limitations:

Problem Description
Knowledge cutoff Training data has a time limit — the model can’t answer questions about recent events
Hallucination The model can “fabricate” content that sounds plausible but is wrong
Missing private data Internal company documents and proprietary knowledge are completely unknown to the model
Poor traceability The model can’t tell users where an answer came from

RAG addresses these issues through external retrieval, making LLM responses verifiable and up-to-date.

1.2 RAG vs Fine-tuning vs Prompt Engineering

Dimension RAG Fine-tuning Prompt Engineering
Knowledge updates Just update the knowledge base Requires retraining Depends on context window
Cost Medium (retrieval system + inference) High (training compute) Low
Use cases Knowledge-intensive Q&A Style/format adaptation Simple task guidance
Explainability High (traceable sources) Low Medium
Data privacy Private data stays in-house Data baked into model weights Depends on implementation

In practice, all three approaches are often used together.

2. Basic RAG Architecture

2.1 Standard Pipeline

1
2
3
4
5
6
7
8
9
10
11
User question

Query processing (rewriting, expansion)

Retrieval — find relevant document chunks from the knowledge base

Context assembly — inject retrieved results into the Prompt

LLM generates a response

Output (with source citations)

2.2 Two Main Phases

Offline Phase (Indexing)

Transform raw data into a searchable knowledge base:

1
Raw documents → Chunking → Embedding → Store in vector store

Online Phase (Querying)

The real-time pipeline when a user asks a question:

1
User question → Embed question → Vector similarity search → Context assembly → LLM generation

3. Core Components in Detail

3.1 Document Chunking

The chunking strategy directly affects retrieval quality.

Strategy Description Use Case
Fixed-length chunking Split by character/token count with overlap General purpose
Paragraph/section chunking Leverage document structure (headings, line breaks) Structured documents
Semantic chunking Dynamically determine split points based on semantic similarity Long text, mixed content
Recursive chunking Recursively split by a hierarchy of separators General purpose; LangChain’s default strategy

Key parameters:

  • chunk_size: Size of each chunk, typically 256–1024 tokens
  • chunk_overlap: Overlap between adjacent chunks, typically 10%–20% of chunk_size
1
2
3
4
5
6
7
8
9
# LangChain recursive character text splitter example
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", "。", ".", " ", ""]
)
chunks = splitter.split_text(document)

Rule of thumb: chunks that are too large → imprecise retrieval with too much noise; chunks that are too small → lost context and incomplete answers. You’ll need to tune based on your actual data.

3.2 Embedding

Convert text into fixed-dimensional dense vectors for similarity computation.

Common embedding models:

Model Provider Dimensions Notes
text-embedding-3-small/large OpenAI 1536/3072 Strong general performance
bge-large-zh BAAI 1024 Excellent Chinese language performance
GTE (General Text Embedding) Alibaba 768/1024 Multilingual
E5-mistral-7b Microsoft 4096 High quality, expensive inference
Cohere Embed v3 Cohere 1024 Multilingual, supports compression
1
2
3
4
5
6
7
8
9
# OpenAI Embedding example
from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="RAG is retrieval-augmented generation"
)
embedding = response.data[0].embedding # 1536-dimensional vector

3.3 Vector Store

A specialized database for storing and retrieving vectors.

Database Type Notes
Chroma Embedded Lightweight, great for prototyping
FAISS Library (Meta) High performance, in-memory, suitable for small to medium scale
Milvus Distributed Production-grade, supports billions of vectors
Pinecone Cloud service Fully managed, zero ops
Weaviate Standalone service Supports hybrid search (vector + keyword)
Qdrant Standalone service Rust implementation, high performance
pgvector PostgreSQL extension Reuses the PostgreSQL ecosystem

Similarity metrics:

  • Cosine Similarity: Most common; measures directional similarity
  • Euclidean Distance (L2): Measures absolute distance
  • Inner Product: Similar to cosine; vectors need to be normalized

3.4 Retrieval Strategies

Basic Retrieval

Dense Retrieval: Embed the query and find nearest neighbors in the vector store.

1
2
3
# Pseudocode: basic vector retrieval
query_embedding = embed(user_query)
results = vector_db.search(query_embedding, top_k=5)

Advanced Retrieval

Hybrid Search: Combine vector retrieval and keyword retrieval (BM25) to get the best of both.

1
2
3
Query ──→ Vector retrieval  → Results A
└──→ BM25 retrieval → Results B
→ Fusion ranking (RRF/weighted) → Final results

Multi-recall + Re-ranking (Retrieve & Re-rank):

1
Query → Multiple retrieval strategies → Candidate set → Cross-Encoder re-ranking → Refined results

Re-ranking models (e.g., bge-reranker, Cohere Rerank) perform deep relevance scoring on query-document pairs, significantly boosting precision.

3.5 Context Assembly and Generation

Assemble retrieved content into the Prompt and hand it to the LLM to generate a response.

A common Prompt template:

1
2
3
4
5
6
7
8
9
10
You are a knowledge Q&A assistant. Answer the user's question based on the reference material below.
If the reference material does not contain relevant information, say so honestly — do not make things up.

## Reference Material
{retrieved_contexts}

## User Question
{user_query}

## Answer

Key points:

  • Explicitly instruct the model to “answer based on the reference material” to reduce hallucination
  • Add an instruction like “say you don’t know if you don’t know”
  • Control the total amount of injected context to avoid exceeding the model’s context window limit

4. Advanced RAG Techniques

4.1 Query Rewriting and Expansion

A user’s raw question is often not the optimal retrieval query.

Technique Description
Query Rewriting Have the LLM rewrite the question to make it more retrieval-friendly
HyDE (Hypothetical Document Embedding) Have the LLM generate a hypothetical answer first, then use that answer’s embedding for retrieval
Multi-Query Decompose one question into multiple sub-questions, retrieve each separately, then merge
Step-back Prompting Have the model first raise a broader, more general question to gather background knowledge
1
2
3
4
# HyDE example
hypothetical_answer = llm.generate(f"Please answer: {user_query}")
# Retrieve using the hypothetical answer's embedding, not the original question
results = vector_db.search(embed(hypothetical_answer), top_k=5)

4.2 GraphRAG

Microsoft’s GraphRAG approach enhances retrieval with a knowledge graph.

Process:

  1. Extract entities and relationships from documents to build a knowledge graph
  2. Apply community detection on the graph to generate summaries at different levels
  3. At query time, leverage both vector retrieval and the graph structure for reasoning

Advantage: excels at answering global questions that require cross-document reasoning (e.g., “summarize the main points across all documents”).

4.3 Adaptive RAG (Adaptive / Self-RAG)

Let the model autonomously decide whether retrieval is needed:

1
2
3
4
5
User question → LLM decides if retrieval is needed
├── Not needed → Answer directly
└── Needed → Retrieve → Assess relevance of retrieved results
├── Relevant → Generate response
└── Not relevant → Switch strategy and re-retrieve

The Self-RAG paper introduces Reflection Tokens that let the model self-evaluate retrieval quality and response reliability during generation.

4.4 Multi-hop RAG

Handle complex questions that require multi-step reasoning:

1
2
3
Question: "Is the founder of Company A still working at Company B?"
→ Hop 1: Retrieve who founded Company A → "John Smith"
→ Hop 2: Retrieve whether John Smith works at Company B → Get the answer

4.5 Agentic RAG

Use RAG as one of an Agent’s tools, with the Agent autonomously planning retrieval strategies.

1
2
3
4
User question → Agent thinks
→ Decides which retrieval tool to call (vector store / search engine / database)
→ Analyzes results, decides whether another retrieval round is needed
→ Synthesizes all information to generate the final response

Tool definition example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
tools = [
{
"name": "search_internal_docs",
"description": "Search the company's internal document library",
"parameters": {"query": "str", "top_k": "int"}
},
{
"name": "search_web",
"description": "Search the internet for the latest information",
"parameters": {"query": "str"}
},
{
"name": "query_database",
"description": "Query a structured database",
"parameters": {"sql": "str"}
}
]

5. Evaluation Framework

5.1 Retrieval Quality Metrics

Metric Description
Precision@K Fraction of the top-K results that are relevant
Recall@K Fraction of all relevant documents that were retrieved
MRR Reciprocal rank of the first relevant result
nDCG Gain metric that accounts for ranking position

5.2 Generation Quality Evaluation

RAGAS is the mainstream framework for evaluating RAG systems:

Metric Dimension Description
Faithfulness Faithfulness Whether the response is consistent with retrieved content (no hallucination)
Answer Relevance Answer relevance Whether the response actually addresses the question
Context Precision Context precision Fraction of retrieved content that is useful
Context Recall Context recall Whether all information needed for the answer was retrieved
1
2
3
4
5
6
7
8
# RAGAS evaluation example
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)

5.3 End-to-End Evaluation

Ultimately, evaluation has to come back to the actual business scenario:

  • Human evaluation: sample scoring for accuracy, completeness, and fluency
  • LLM-as-Judge: use a stronger model to evaluate the output of a weaker model
  • A/B testing: compare user satisfaction between different RAG configurations in production

6. Engineering Best Practices

6.1 Data Processing

  • Data cleaning: Remove garbled text, duplicates, and formatting noise
  • Metadata tagging: Add source, timestamp, category, and other metadata to each chunk to enable filtered retrieval
  • Multi-format parsing: PDF, Word, HTML, Markdown, tables, and images (OCR) each have their own parsing approaches

6.2 Performance Optimization

Optimization Area Methods
Retrieval latency Vector index optimization (HNSW, IVF), cache hot queries
Recall Multi-recall, query expansion, increase top_k
Precision Re-ranking, add metadata filtering
Context utilization Compress irrelevant content, inject only key passages
Cost Embedding caching, small model for coarse ranking + large model for fine ranking

6.3 Common Issues and Solutions

Issue Root Cause Solution
Relevant content not retrieved Chunks too granular / poor embedding quality Adjust chunking strategy, swap embedding model
Retrieved but response is wrong Too much noise in context Reduce injected context, add re-ranking
Response hallucination Model ignores context and generates freely Strengthen Prompt instructions, lower temperature
Incomplete response Information scattered across multiple chunks Multi-hop retrieval, increase chunk_size
High latency Double overhead from retrieval + LLM Async retrieval, streaming output, caching

6.4 Tech Stack Reference

Prototype / fast start:

  • Embedding: OpenAI text-embedding-3-small
  • Vector store: Chroma / FAISS
  • Framework: LangChain / LlamaIndex

Production deployment:

  • Embedding: BGE / GTE (self-hostable, data stays on-premise)
  • Vector store: Milvus / Weaviate / pgvector
  • Framework: LlamaIndex (more flexible) or a custom-built pipeline
  • Re-ranking: bge-reranker-v2-m3 / Cohere Rerank

7. Framework Comparison

Framework Language Highlights Best For
LangChain Python/JS Broadest ecosystem, rich modules, quick to learn Rapid prototyping, general RAG
LlamaIndex Python RAG-focused, rich data connectors Data-intensive applications
Haystack Python Production-oriented, clean Pipeline design Enterprise search Q&A
Semantic Kernel C#/Python/Java Microsoft product, tight Azure ecosystem integration .NET stack projects
DSPy Python Programmatic Prompt optimization, auto-tuning Academic research, Prompt engineering

8. Learning Resources

Papers

Hands-on Projects

  • LangChain RAG templates: the official repo provides reference implementations for multiple RAG architectures
  • LlamaIndex official examples: covers RAG patterns from basic to advanced
  • RAGAS: RAG evaluation framework with complete evaluation examples included

These notes map out the knowledge landscape of RAG from foundational concepts to advanced practice. The essence of RAG is “have the LLM look things up before answering” — simple in concept, but every step (chunking, embedding, retrieval, generation, evaluation) has a wealth of engineering nuance. I’d recommend starting with the simplest vector-retrieval RAG, then gradually introducing more advanced techniques as you hit concrete problems. Don’t jump straight into GraphRAG + Agent + Multi-hop from day one — excessive complexity is often why RAG projects fail.