RAG System Development (Retrieval-Augmented Generation)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
RAG System Development (Retrieval-Augmented Generation)
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

RAG System Development (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation) is an architecture where a language model accesses an external knowledge store when generating answers. Instead of relying solely on knowledge embedded in weights during pretraining, the model receives relevant context at inference time. This enables working with current data, corporate documents, and specialized knowledge bases without expensive fine-tuning.

Basic RAG Architecture

User → Query
         ↓
    Embedding Model
         ↓
    Vector Search (Top-K)
         ↓
Retrieved Chunks + Query
         ↓
        LLM
         ↓
       Answer

Components:

  • Indexing pipeline: document loading, chunking, embedding, vector database storage
  • Retrieval: query vectorization, nearest neighbor search
  • Generation: passing context + query to LLM

RAG System Stack

Component Options
Embedding Model OpenAI text-embedding-3-large, Cohere Embed v3, BGE-M3, E5-large, Nomic Embed
Vector Database Pinecone, Weaviate, Qdrant, ChromaDB, pgvector, Milvus
LLM GPT-4o, Claude 3.5 Sonnet, Llama 3.1, Mistral
Orchestrator LangChain, LlamaIndex, custom implementation
Reranker Cohere Rerank, BGE-Reranker, FlashRank

Indexing Pipeline

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain_community.document_loaders import PyPDFDirectoryLoader

# Document loading
loader = PyPDFDirectoryLoader("./docs/")
documents = loader.load()

# Chunking
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(documents)

# Embedding and storage
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Qdrant.from_documents(
    chunks,
    embeddings,
    url="http://localhost:6333",
    collection_name="corporate-docs",
    force_recreate=True,
)

Query Response Pipeline

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import ChatPromptTemplate

template = """You are an assistant that answers strictly based on the provided context.
If the answer is not in the context, say "Information not found in knowledge base".
Always indicate the source (document name and section).

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Retrieval + Generation
retriever = vectorstore.as_retriever(
    search_type="mmr",   # Maximum Marginal Relevance — reduces duplication
    search_kwargs={"k": 5, "fetch_k": 20}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True,
)

result = qa_chain.invoke({"query": "What is the warranty service period?"})

Practical Case: RAG for Insurance Company

Task: assistant for processing customer inquiries — searching insurance contracts, payment rules, precedent decisions (12,000 documents, ~2M pages).

Key Solutions:

Embedding: BGE-M3 (multilingual, works well with Russian, free self-hosted). Dimension 1024.

Chunking: hybrid strategy — structural boundaries (contract sections) instead of fixed size. Chunk size 200–600 tokens.

Reranking: CrossEncoder after vector search. Top-50 candidates → Top-5 after rerank. +18% to faithfulness.

Metrics (RAGAS):

Metric Before rerank After rerank
Context Precision 0.68 0.84
Context Recall 0.71 0.79
Faithfulness 0.74 0.91
Answer Relevancy 0.81 0.89

Chunk Size: How to Choose

  • Small chunks (128–256 tokens): high retrieval accuracy, but may lack full context for answer. Good for FAQs and short facts.
  • Medium chunks (512–1024 tokens): balanced approach. Optimal for most tasks.
  • Large chunks (1024–2048 tokens): capture more context, but reduce retrieval precision. For documents with long interdependent sections.

Parent Document Retriever — solution to the dilemma: index small chunks for search, return large chunks to LLM.

RAG System Development Timeline

  • Prototype (basic RAG): 1–2 weeks
  • Production-ready system with quality evaluation: 4–8 weeks
  • Advanced RAG (hybrid search, reranking, evaluation): 8–14 weeks