LLM: Fine-tuning, RAG, Agents, and Production Deployment
GPT-4 or Claude 3.5 Sonnet via API is not a solution to the problem, it's a tool. When requirements arrive like "build ChatGPT-like system on our data," it masks a spectrum of real tasks: from prompt tuning to training a 70B-parameter model. Where your task sits depends on data, latency requirements, budget, and how critical confidentiality is.
Let's break down each layer of the stack separately.
RAG: Where It Usually Breaks and Why
RAG (Retrieval-Augmented Generation) — architecturally simple: find relevant documents, put in context, model answers. Practice breaks down in several places.
Chunking without overlap. Classic error: chunk_size=512, overlap=0. If answer spans chunk boundary, retrieval finds neither with sufficient confidence. Solution: 15-25% overlap, consider sentence-aware splitting via spaCy or NLTK instead of naive character splitting.
Poor embedding model. text-embedding-ada-002 is good general embedder, but loses to specialized models on legal or medical texts. E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data deliver significantly better Recall@k — difference can be 15-25% on Recall@5.
No re-ranking. Vector search optimizes for speed, not relevance. Cross-encoder re-ranking (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval substantially improves top-3 accuracy at acceptable latency (+50-150ms). Often more important than improving embedding model.
Hybrid search. Dense vectors alone perform poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches well but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.
Typical production RAG architecture for corporate knowledge base: documents → preprocessing (PyMuPDF, Unstructured) → chunking → embedding (BGE-M3) → Qdrant → hybrid search → cross-encoder re-ranking → context → LLM (vLLM or OpenAI API) → answer with sources.
Fine-tuning: When Prompt Engineering Is Not Enough
Prompt engineering solves 70% of LLM adaptation tasks. The remaining 30% require fine-tuning. Signs you need it: model ignores specific output format despite detailed description; task requires deep knowledge of specialized vocabulary (medicine, law, engineering); need significant cost reduction by replacing large model with specialized smaller one.
LoRA and QLoRA are the standard for supervised fine-tuning. LoRA adds trainable low-rank matrices to attention layers without modifying base weights. Typical config for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]. Trainable parameters ~0.8% of 8B — training fits on single A100 40GB.
QLoRA adds 4-bit quantization of base model: load in NF4 via bitsandbytes, train only LoRA adapters in bf16. Allows fine-tuning 70B model on two A100 40GB, though training speed drops approximately 2x versus full bf16.
DPO instead of RLHF. Direct Preference Optimization — simpler alternative to RLHF for alignment to style or preferences. Needs pairs (chosen, rejected) instead of scalar reward signals. trl (Transformer Reinforcement Learning) from Hugging Face has ready DPOTrainer — implementation takes tens of lines.
Common fine-tuning mistake. 500-example dataset, 5 epochs training, validation loss 0.8 — looks fine. But on test model degraded on general instructions. Cause: catastrophic forgetting. Solution — add 10-20% general instruction-following examples (e.g., from Alpaca or FLAN) to prevent destroying original capabilities.
Prompt Engineering and Structured Outputs
Prompt engineering is not "write a good prompt." It's systematic work with format, few-shot examples, chain-of-thought, and context management.
For tasks requiring structured output (JSON, entity extraction, classification), use function calling / tool use (OpenAI, Claude, Mistral) or constrained generation via Outlines or Guidance. Guarantees output format without regex postprocessing.
Structured outputs via response_format={"type": "json_schema", ...} in OpenAI API — most reliable way for production where downstream systems expect specific schema.
Prompt evaluation is separate work. Build eval datasets from 50-200 real examples with ground truth, run automatic metrics (ROUGE, BERTScore for open answers; accuracy for classification) plus LLM-as-judge for qualitative evaluation.
Multi-Agent Systems
Agents — LLM with access to tools: web search, code execution, API requests, database queries. Key patterns:
ReAct (Reason + Act). Model reasons → selects tool → observes result → reasons again. LangChain and LlamaIndex implement this pattern out-of-the-box. For production add tool timeouts and max step limits.
Multi-agent orchestration. Multiple specialized agents with coordinator-agent on top. Example: coordinator → researcher (search + summarization) + coder (code generation and execution) + critic (verification). AutoGen (Microsoft), CrewAI, or custom implementation via LangGraph.
Practical note. Agent systems are non-deterministic. Production means: mandatory guardrails (output validation, step/cost limits), every-step logging, human-in-the-loop option for critical actions.
vLLM and Production LLM Deployment
For serving proprietary or open-source models under load, vLLM is first choice.
PagedAttention. Key vLLM innovation: KV-cache managed like virtual memory in OS, without fragmentation. Enables processing parallel requests with different context lengths without extra memory copying. Result: 2-4x throughput versus naive HuggingFace Transformers inference.
Continuous batching. Requests added to batch as they arrive, not waiting for full batch. Reduces latency under uneven load.
Typical numbers on A100 80GB for Llama-3 8B (bf16): 400-600 req/s (output tokens/s), P50 latency 200-400ms, P99 latency 600-900ms at concurrency 64. For 70B on two A100 80GB with tensor parallelism: 80-120 req/s, P99 latency 1.5-2.5s.
Quantization via AWQ or GPTQ reduces memory 2x with quality degradation within 1-3% on most benchmarks. On A10G (24GB) this allows running 13B model where non-quantized only fits 7B.
Deployment monitoring. Log: latency (P50/P95/P99), throughput (tokens/s), queue depth, cache hit rate. Grafana + Prometheus standard. vLLM exports metrics natively in Prometheus format.
Base Model Selection
| Model | Parameters | Strengths | Context |
|---|---|---|---|
| Llama-3.1 8B | 8B | Quality/speed balance | 128k |
| Llama-3.1 70B | 70B | Complex reasoning | 128k |
| Mistral 7B / Mixtral 8x7B | 7B / 47B | Efficiency per size | 32k |
| Qwen2.5 72B | 72B | Code, multilingual | 128k |
| Gemma 2 27B | 27B | Open license | 8k |
Fine-tuning 8B model suffices for most tasks. 70B needed when task requires deep reasoning or baseline 8B doesn't reach required quality even after fine-tuning.
Project Workflow
Task audit. Formalize exactly what model should do, collect 100+ real examples as eval dataset. Without eval you can't measure progress.
Baseline via prompt engineering. Test OpenAI/Anthropic API with well-tuned system prompt. Often sufficient. If not — see concrete gap, understand what needs change.
RAG or fine-tuning. If problem is knowledge of specific documents — RAG. If problem is style, format, specialized vocabulary — fine-tuning. Often need both.
Training and validation. Prepare dataset, run training with tracking in W&B, evaluate on holdout and real user queries.
Deployment and monitoring. vLLM on own infrastructure or managed inference (Together, Replicate, Modal). Setup alerts on latency and quality.
Timelines: basic RAG prototype — 1-2 weeks. Fine-tuning with customer data — 3-6 weeks (including data prep). Production system with monitoring and retraining — 2-4 months.







