RAG (Retrieval-Augmented Generation) Learning Notes

1. What Is RAG

RAG (Retrieval-Augmented Generation) is a technical paradigm that combines information retrieval with large language model generation. The core idea is straightforward: before letting an LLM answer a question, first retrieve relevant content from an external knowledge base, feed that retrieved content as context to the model, and then have the model generate a response grounded in that real data.

1.1 Why We Need RAG

LLMs have several inherent limitations:

Problem	Description
Knowledge cutoff	Training data has a time limit — the model can’t answer questions about recent events
Hallucination	The model can “fabricate” content that sounds plausible but is wrong
Missing private data	Internal company documents and proprietary knowledge are completely unknown to the model
Poor traceability	The model can’t tell users where an answer came from

RAG addresses these issues through external retrieval, making LLM responses verifiable and up-to-date.

1.2 RAG vs Fine-tuning vs Prompt Engineering

Dimension	RAG	Fine-tuning	Prompt Engineering
Knowledge updates	Just update the knowledge base	Requires retraining	Depends on context window
Cost	Medium (retrieval system + inference)	High (training compute)	Low
Use cases	Knowledge-intensive Q&A	Style/format adaptation	Simple task guidance
Explainability	High (traceable sources)	Low	Medium
Data privacy	Private data stays in-house	Data baked into model weights	Depends on implementation

In practice, all three approaches are often used together.

2. Basic RAG Architecture

2.1 Standard Pipeline

User question
  ↓
Query processing (rewriting, expansion)
  ↓
Retrieval — find relevant document chunks from the knowledge base
  ↓
Context assembly — inject retrieved results into the Prompt
  ↓
LLM generates a response
  ↓
Output (with source citations)

2.2 Two Main Phases

Offline Phase (Indexing)

Transform raw data into a searchable knowledge base:

1	Raw documents → Chunking → Embedding → Store in vector store

Online Phase (Querying)

The real-time pipeline when a user asks a question:

1	User question → Embed question → Vector similarity search → Context assembly → LLM generation

3. Core Components in Detail

3.1 Document Chunking

The chunking strategy directly affects retrieval quality.

Strategy	Description	Use Case
Fixed-length chunking	Split by character/token count with overlap	General purpose
Paragraph/section chunking	Leverage document structure (headings, line breaks)	Structured documents
Semantic chunking	Dynamically determine split points based on semantic similarity	Long text, mixed content
Recursive chunking	Recursively split by a hierarchy of separators	General purpose; LangChain’s default strategy

Key parameters:

chunk_size: Size of each chunk, typically 256–1024 tokens
chunk_overlap: Overlap between adjacent chunks, typically 10%–20% of chunk_size

# LangChain recursive character text splitter example
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", "。", ".", " ", ""]
)
chunks = splitter.split_text(document)

Rule of thumb: chunks that are too large → imprecise retrieval with too much noise; chunks that are too small → lost context and incomplete answers. You’ll need to tune based on your actual data.

3.2 Embedding

Convert text into fixed-dimensional dense vectors for similarity computation.

Common embedding models:

Model	Provider	Dimensions	Notes
text-embedding-3-small/large	OpenAI	1536/3072	Strong general performance
bge-large-zh	BAAI	1024	Excellent Chinese language performance
GTE (General Text Embedding)	Alibaba	768/1024	Multilingual
E5-mistral-7b	Microsoft	4096	High quality, expensive inference
Cohere Embed v3	Cohere	1024	Multilingual, supports compression

# OpenAI Embedding example
from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="RAG is retrieval-augmented generation"
)
embedding = response.data[0].embedding  # 1536-dimensional vector

3.3 Vector Store

A specialized database for storing and retrieving vectors.

Database	Type	Notes
Chroma	Embedded	Lightweight, great for prototyping
FAISS	Library (Meta)	High performance, in-memory, suitable for small to medium scale
Milvus	Distributed	Production-grade, supports billions of vectors
Pinecone	Cloud service	Fully managed, zero ops
Weaviate	Standalone service	Supports hybrid search (vector + keyword)
Qdrant	Standalone service	Rust implementation, high performance
pgvector	PostgreSQL extension	Reuses the PostgreSQL ecosystem

Similarity metrics:

Cosine Similarity: Most common; measures directional similarity
Euclidean Distance (L2): Measures absolute distance
Inner Product: Similar to cosine; vectors need to be normalized

3.4 Retrieval Strategies

Basic Retrieval

Dense Retrieval: Embed the query and find nearest neighbors in the vector store.

1
2
3

# Pseudocode: basic vector retrieval
query_embedding = embed(user_query)
results = vector_db.search(query_embedding, top_k=5)

Advanced Retrieval

Hybrid Search: Combine vector retrieval and keyword retrieval (BM25) to get the best of both.

1
2
3

Query ──→ Vector retrieval  → Results A
     └──→ BM25 retrieval    → Results B
                            → Fusion ranking (RRF/weighted) → Final results

Multi-recall + Re-ranking (Retrieve & Re-rank):

1	Query → Multiple retrieval strategies → Candidate set → Cross-Encoder re-ranking → Refined results

Re-ranking models (e.g., bge-reranker, Cohere Rerank) perform deep relevance scoring on query-document pairs, significantly boosting precision.

3.5 Context Assembly and Generation

Assemble retrieved content into the Prompt and hand it to the LLM to generate a response.

A common Prompt template:

You are a knowledge Q&A assistant. Answer the user's question based on the reference material below.
If the reference material does not contain relevant information, say so honestly — do not make things up.

## Reference Material
{retrieved_contexts}

## User Question
{user_query}

## Answer

Key points:

Explicitly instruct the model to “answer based on the reference material” to reduce hallucination
Add an instruction like “say you don’t know if you don’t know”
Control the total amount of injected context to avoid exceeding the model’s context window limit

4. Advanced RAG Techniques

4.1 Query Rewriting and Expansion

A user’s raw question is often not the optimal retrieval query.

Technique	Description
Query Rewriting	Have the LLM rewrite the question to make it more retrieval-friendly
HyDE (Hypothetical Document Embedding)	Have the LLM generate a hypothetical answer first, then use that answer’s embedding for retrieval
Multi-Query	Decompose one question into multiple sub-questions, retrieve each separately, then merge
Step-back Prompting	Have the model first raise a broader, more general question to gather background knowledge

# HyDE example
hypothetical_answer = llm.generate(f"Please answer: {user_query}")
# Retrieve using the hypothetical answer's embedding, not the original question
results = vector_db.search(embed(hypothetical_answer), top_k=5)

4.2 GraphRAG

Microsoft’s GraphRAG approach enhances retrieval with a knowledge graph.

Process:

Extract entities and relationships from documents to build a knowledge graph
Apply community detection on the graph to generate summaries at different levels
At query time, leverage both vector retrieval and the graph structure for reasoning

Advantage: excels at answering global questions that require cross-document reasoning (e.g., “summarize the main points across all documents”).

4.3 Adaptive RAG (Adaptive / Self-RAG)

Let the model autonomously decide whether retrieval is needed:

User question → LLM decides if retrieval is needed
                  ├── Not needed → Answer directly
                  └── Needed → Retrieve → Assess relevance of retrieved results
                                    ├── Relevant → Generate response
                                    └── Not relevant → Switch strategy and re-retrieve

The Self-RAG paper introduces Reflection Tokens that let the model self-evaluate retrieval quality and response reliability during generation.

4.4 Multi-hop RAG

Handle complex questions that require multi-step reasoning:

1
2
3

Question: "Is the founder of Company A still working at Company B?"
  → Hop 1: Retrieve who founded Company A → "John Smith"
  → Hop 2: Retrieve whether John Smith works at Company B → Get the answer

4.5 Agentic RAG

Use RAG as one of an Agent’s tools, with the Agent autonomously planning retrieval strategies.

User question → Agent thinks
               → Decides which retrieval tool to call (vector store / search engine / database)
               → Analyzes results, decides whether another retrieval round is needed
               → Synthesizes all information to generate the final response

Tool definition example:

tools = [
    {
        "name": "search_internal_docs",
        "description": "Search the company's internal document library",
        "parameters": {"query": "str", "top_k": "int"}
    },
    {
        "name": "search_web",
        "description": "Search the internet for the latest information",
        "parameters": {"query": "str"}
    },
    {
        "name": "query_database",
        "description": "Query a structured database",
        "parameters": {"sql": "str"}
    }
]

5. Evaluation Framework

5.1 Retrieval Quality Metrics

Metric	Description
Precision@K	Fraction of the top-K results that are relevant
Recall@K	Fraction of all relevant documents that were retrieved
MRR	Reciprocal rank of the first relevant result
nDCG	Gain metric that accounts for ranking position

5.2 Generation Quality Evaluation

RAGAS is the mainstream framework for evaluating RAG systems:

Metric	Dimension	Description
Faithfulness	Faithfulness	Whether the response is consistent with retrieved content (no hallucination)
Answer Relevance	Answer relevance	Whether the response actually addresses the question
Context Precision	Context precision	Fraction of retrieved content that is useful
Context Recall	Context recall	Whether all information needed for the answer was retrieved

# RAGAS evaluation example
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

5.3 End-to-End Evaluation

Ultimately, evaluation has to come back to the actual business scenario:

Human evaluation: sample scoring for accuracy, completeness, and fluency
LLM-as-Judge: use a stronger model to evaluate the output of a weaker model
A/B testing: compare user satisfaction between different RAG configurations in production

6. Engineering Best Practices

6.1 Data Processing

Data cleaning: Remove garbled text, duplicates, and formatting noise
Metadata tagging: Add source, timestamp, category, and other metadata to each chunk to enable filtered retrieval
Multi-format parsing: PDF, Word, HTML, Markdown, tables, and images (OCR) each have their own parsing approaches

6.2 Performance Optimization

Optimization Area	Methods
Retrieval latency	Vector index optimization (HNSW, IVF), cache hot queries
Recall	Multi-recall, query expansion, increase top_k
Precision	Re-ranking, add metadata filtering
Context utilization	Compress irrelevant content, inject only key passages
Cost	Embedding caching, small model for coarse ranking + large model for fine ranking

6.3 Common Issues and Solutions

Issue	Root Cause	Solution
Relevant content not retrieved	Chunks too granular / poor embedding quality	Adjust chunking strategy, swap embedding model
Retrieved but response is wrong	Too much noise in context	Reduce injected context, add re-ranking
Response hallucination	Model ignores context and generates freely	Strengthen Prompt instructions, lower temperature
Incomplete response	Information scattered across multiple chunks	Multi-hop retrieval, increase chunk_size
High latency	Double overhead from retrieval + LLM	Async retrieval, streaming output, caching

6.4 Tech Stack Reference

Prototype / fast start:

Embedding: OpenAI text-embedding-3-small
Vector store: Chroma / FAISS
Framework: LangChain / LlamaIndex

Production deployment:

Embedding: BGE / GTE (self-hostable, data stays on-premise)
Vector store: Milvus / Weaviate / pgvector
Framework: LlamaIndex (more flexible) or a custom-built pipeline
Re-ranking: bge-reranker-v2-m3 / Cohere Rerank

7. Framework Comparison

Framework	Language	Highlights	Best For
LangChain	Python/JS	Broadest ecosystem, rich modules, quick to learn	Rapid prototyping, general RAG
LlamaIndex	Python	RAG-focused, rich data connectors	Data-intensive applications
Haystack	Python	Production-oriented, clean Pipeline design	Enterprise search Q&A
Semantic Kernel	C#/Python/Java	Microsoft product, tight Azure ecosystem integration	.NET stack projects
DSPy	Python	Programmatic Prompt optimization, auto-tuning	Academic research, Prompt engineering

8. Learning Resources

Papers

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Original RAG paper, Facebook AI, 2020)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
GraphRAG: Unlocking LLM Discovery on Complex Data (Microsoft, 2024)

Hands-on Projects

LangChain RAG templates: the official repo provides reference implementations for multiple RAG architectures
LlamaIndex official examples: covers RAG patterns from basic to advanced
RAGAS: RAG evaluation framework with complete evaluation examples included

These notes map out the knowledge landscape of RAG from foundational concepts to advanced practice. The essence of RAG is “have the LLM look things up before answering” — simple in concept, but every step (chunking, embedding, retrieval, generation, evaluation) has a wealth of engineering nuance. I’d recommend starting with the simplest vector-retrieval RAG, then gradually introducing more advanced techniques as you hit concrete problems. Don’t jump straight into GraphRAG + Agent + Multi-hop from day one — excessive complexity is often why RAG projects fail.

AI RAG LLM NLP