RAG System Development (Retrieval-Augmented Generation)
RAG (Retrieval-Augmented Generation) is an architecture where a language model accesses an external knowledge store when generating answers. Instead of relying solely on knowledge embedded in weights during pretraining, the model receives relevant context at inference time. This enables working with current data, corporate documents, and specialized knowledge bases without expensive fine-tuning.
Basic RAG Architecture
User → Query
↓
Embedding Model
↓
Vector Search (Top-K)
↓
Retrieved Chunks + Query
↓
LLM
↓
Answer
Components:
- Indexing pipeline: document loading, chunking, embedding, vector database storage
- Retrieval: query vectorization, nearest neighbor search
- Generation: passing context + query to LLM
RAG System Stack
| Component | Options |
|---|---|
| Embedding Model | OpenAI text-embedding-3-large, Cohere Embed v3, BGE-M3, E5-large, Nomic Embed |
| Vector Database | Pinecone, Weaviate, Qdrant, ChromaDB, pgvector, Milvus |
| LLM | GPT-4o, Claude 3.5 Sonnet, Llama 3.1, Mistral |
| Orchestrator | LangChain, LlamaIndex, custom implementation |
| Reranker | Cohere Rerank, BGE-Reranker, FlashRank |
Indexing Pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain_community.document_loaders import PyPDFDirectoryLoader
# Document loading
loader = PyPDFDirectoryLoader("./docs/")
documents = loader.load()
# Chunking
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(documents)
# Embedding and storage
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Qdrant.from_documents(
chunks,
embeddings,
url="http://localhost:6333",
collection_name="corporate-docs",
force_recreate=True,
)
Query Response Pipeline
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import ChatPromptTemplate
template = """You are an assistant that answers strictly based on the provided context.
If the answer is not in the context, say "Information not found in knowledge base".
Always indicate the source (document name and section).
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Retrieval + Generation
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance — reduces duplication
search_kwargs={"k": 5, "fetch_k": 20}
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True,
)
result = qa_chain.invoke({"query": "What is the warranty service period?"})
Practical Case: RAG for Insurance Company
Task: assistant for processing customer inquiries — searching insurance contracts, payment rules, precedent decisions (12,000 documents, ~2M pages).
Key Solutions:
Embedding: BGE-M3 (multilingual, works well with Russian, free self-hosted). Dimension 1024.
Chunking: hybrid strategy — structural boundaries (contract sections) instead of fixed size. Chunk size 200–600 tokens.
Reranking: CrossEncoder after vector search. Top-50 candidates → Top-5 after rerank. +18% to faithfulness.
Metrics (RAGAS):
| Metric | Before rerank | After rerank |
|---|---|---|
| Context Precision | 0.68 | 0.84 |
| Context Recall | 0.71 | 0.79 |
| Faithfulness | 0.74 | 0.91 |
| Answer Relevancy | 0.81 | 0.89 |
Chunk Size: How to Choose
- Small chunks (128–256 tokens): high retrieval accuracy, but may lack full context for answer. Good for FAQs and short facts.
- Medium chunks (512–1024 tokens): balanced approach. Optimal for most tasks.
- Large chunks (1024–2048 tokens): capture more context, but reduce retrieval precision. For documents with long interdependent sections.
Parent Document Retriever — solution to the dilemma: index small chunks for search, return large chunks to LLM.
RAG System Development Timeline
- Prototype (basic RAG): 1–2 weeks
- Production-ready system with quality evaluation: 4–8 weeks
- Advanced RAG (hybrid search, reranking, evaluation): 8–14 weeks







