What are the different chunking strategies?

Fixed size (256-512 tokens), recursive chunking with overlap (10-20%), semantic splitter (based on sentence boundaries), and structural (by section headings). The choice depends on document type: for legal texts, structural is best; for FAQs, fixed size works well.

What is hybrid search and why is it needed?

Hybrid search combines sparse (BM25) and dense (embedding) methods. Sparse excels at exact term matches, while dense finds semantically similar fragments. Together, they improve relevance by 20-30% in Recall@k metrics.

Which vector database should I choose?

It depends on load: Qdrant and Weaviate are suitable for production with high speed requirements, ChromaDB for prototyping. For multi-tenant systems, we recommend Qdrant with metadata filtering support.

What retrieval quality metrics are important?

Key metrics: Recall@k (proportion of relevant documents in top-k), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG). Latency p99 and inference cost are also critical.

What does RAG architecture design include?

We analyze data sources, select chunking strategy, embedding model and vector DB, design ingestion and retrieval pipelines, configure reranker, develop evaluation framework, and conduct load testing. The result is architecture documentation and a working prototype.

What are the different chunking strategies?

Fixed size (256-512 tokens), recursive chunking with overlap (10-20%), semantic splitter (based on sentence boundaries), and structural (by section headings). The choice depends on document type: for legal texts, structural is best; for FAQs, fixed size works well.

What is hybrid search and why is it needed?

Hybrid search combines sparse (BM25) and dense (embedding) methods. Sparse excels at exact term matches, while dense finds semantically similar fragments. Together, they improve relevance by 20-30% in Recall@k metrics.

Which vector database should I choose?

It depends on load: Qdrant and Weaviate are suitable for production with high speed requirements, ChromaDB for prototyping. For multi-tenant systems, we recommend Qdrant with metadata filtering support.

What retrieval quality metrics are important?

Key metrics: Recall@k (proportion of relevant documents in top-k), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG). Latency p99 and inference cost are also critical.

What does RAG architecture design include?

We analyze data sources, select chunking strategy, embedding model and vector DB, design ingestion and retrieval pipelines, configure reranker, develop evaluation framework, and conduct load testing. The result is architecture documentation and a working prototype.

Production RAG Pipeline Design, Evaluation, and Optimization

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Production RAG Pipeline Design, Evaluation, and Optimization

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Note: When a RAG pipeline returns irrelevant results, the problem is often in the retrieval architecture. A basic RAG 'works' in a day, but a production-ready system with reliable retrieval, monitoring, and controlled costs requires careful design. We have designed and deployed over 30 RAG systems for various industries — from legal documents to technical support. Our experience shows that 70% of success lies in the pipeline architecture. For example, a FinTech client used dense-only search and achieved a recall of 0.65. After implementing hybrid search with a reranker, recall increased to 0.92, with p99 latency remaining under 200 ms — this significantly reduced the cost of re-queries to the LLM (by 40%, saving $5,000 per month). Retrieval-Augmented Generation is not just a buzzword but a mature technique that requires sound engineering implementation.

Components of a Modern RAG Pipeline

┌─────────────────────────────────────────────────────┐
│                 INGESTION PIPELINE                   │
│  Sources → Loaders → Parsers → Chunkers → Embedder  │
│           → Metadata Extractor → Vector Store        │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│                 RETRIEVAL PIPELINE                   │
│  Query → Query Transformer → Multi-Index Search     │
│        → Reranker → Context Assembler               │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│                GENERATION PIPELINE                   │
│  Context + Query → Prompt Builder → LLM             │
│                  → Response Validator → User         │
└─────────────────────────────────────────────────────┘

How to Choose a Chunking Strategy?

Chunking determines retrieval quality. Fixed size 256-512 tokens is simple but loses context. Structural chunking by section headings yields better results for documents with clear hierarchy. Comparison:

Strategy	When to Use	Recall@10	Latency
Fixed 256 tokens	Short texts, FAQ	0.78	2 ms
Recursive with 15% overlap	General documents	0.85	5 ms
Semantic splitter	Long narratives	0.88	12 ms
Structural (by headings)	Legal, technical	0.92	8 ms

We use a semantic splitter based on sentence boundaries with 10% overlap. This gives a balance of quality and speed. Metadata (source, date, type) is attached to each chunk. Semantic chunking is 1.5 times better than fixed-size for recall on narrative texts.

Why Hybrid Search (Sparse + Dense) Is Necessary?

Dense embeddings are great at finding semantic duplicates but struggle with rare terms or abbreviations. Sparse (BM25) does the opposite. According to Qdrant documentation, hybrid search with Reciprocal Rank Fusion (RRF) improves relevance metrics by 20-30%. Hybrid search combines both approaches and is 1.25 times better than dense-only for recall.

from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector

def hybrid_search(query, top_k=10):
    dense_vector = embedder.embed_query(query)
    sparse_vector = sparse_encoder.encode(query)
    results = client.query_points(
        collection_name="docs",
        prefetch=[
            {"query": dense_vector, "using": "dense", "limit": 30},
            {"query": SparseVector(indices=sparse_vector.indices, values=sparse_vector.values),
             "using": "sparse", "limit": 30},
        ],
        query=rrf_fusion,  # Reciprocal Rank Fusion
        limit=top_k,
    )
    return results

In our projects, hybrid search increases Recall@10 by 25-30% compared to dense-only — a factor of 1.25-1.3x. Additionally, we use a reranker based on a cross-encoder (e.g., ms-marco-MiniLM-L-12-v2), which reorders the top-30 results to top-5. This improves precision by 1.5-2x. Rerankers are 2 times better than no reranker for top-5 relevance.

Evaluating Retrieval Quality

For production, an evaluation framework with a set of metrics is essential: Recall@k, MRR, NDCG. We use a custom dataset of 500+ queries with relevance annotations. Load testing checks p99 latency — typically target < 500 ms. Production monitoring detects data drift and metric degradation. We guarantee our pipelines meet MLOps best practices.

Comparison of Embedding Models

Model	Dimensionality	Recall@10 (ours)	Cost per million tokens
`text-embedding-ada-002` (OpenAI)	1536	0.85	$0.10
`embed-english-v3.0` (Cohere)	1024	0.87	$0.08
`intfloat/e5-large-v2` (open-source)	1024	0.84	$0.03 (self-hosted)

For production, we recommend open-source models (E5, BGE) with self-hosting — this reduces operational costs several times with comparable quality. An open-source model is 3 times cheaper than OpenAI per million tokens — saving $0.07 per million tokens. At scale (1M queries/day, 5M tokens), this saves $10,500 per month.

Typical Mistakes in RAG Design

Insufficient chunk size: if a chunk is too long (>1024 tokens), the LLM loses context.
Missing hybrid search: dense-only misses low-frequency terms.
No evaluation: without validation on a representative dataset, real quality is unknown.
Ignoring latency: the pipeline must meet SLA (typically p99 < 1 s).

Process

Analytics — studying sources, query types, expected load (RPS).
Design — stack selection (vector DB, embedding model, chunking).
Implementation — building ingestion pipeline, retrieval pipeline, integration with LLM.
Testing — evaluation on custom dataset, A/B testing of different configurations.
Deployment and monitoring — deployment via Docker/Kubernetes, alerting on latency and recall.

What's Included

Architecture documentation (diagrams, component specifications)
Stack selection and justification (vector DB, embedding, LLM)
Pipeline prototype with hybrid search and reranker
Evaluation framework with metric suite
Integration with existing infrastructure
Customer team training and code review
Support during deployment phase (up to 1 month)

Timelines (Approximate)

Design: 1 week
Ingestion pipeline: 1–2 weeks
Retrieval pipeline with reranker: 2–3 weeks
Evaluation and optimization: 1–2 weeks
Production hardening: 1–2 weeks
Total: 6–10 weeks depending on complexity

The cost is calculated individually — contact us for a consultation on RAG architecture for your project. We guarantee the result will comply with MLOps best practices and your SLA. Hybrid search reduces LLM inference costs by 30-50% (saving $5,000-$10,000 per month at 1M queries/day), which is especially noticeable under high loads. Order the design of a RAG pipeline and get a working prototype in as little as 3 weeks. With 5+ years of experience and 30+ deployments, we ensure your pipeline is production-ready.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.