Fine-Tuning Open-Source LLMs for Client Tasks
Fine-tuning an open-source language model is the most flexible path to obtaining a specialized AI tool with complete control over data and infrastructure. Unlike API models (GPT-4o, Claude), you own the weights, can deploy the model on-premise, scale inference without per-token fees, and adapt the architecture to specific requirements.
Choosing Base Model for Task
Base model selection is critical. Wrong choice leads to rework during iteration phase.
| Task Class | Recommended Models | Rationale |
|---|---|---|
| Classification, NER, structured output | Llama 3.1 8B, Mistral 7B, Phi-4-mini | Quality sufficient, fast inference |
| Russian text generation | Qwen2.5-7B/14B, Llama 3.1 8B | Strong multilingual support |
| Programming, SQL, code review | Qwen2.5-Coder-32B, DeepSeek-Coder-V2, Phi-4 | Specialized code models |
| Complex reasoning, analysis | DeepSeek-R1-Distill-32B, Llama 3.1 70B | High reasoning, instruction-following |
| Edge/offline/mobile | Phi-4-mini, Qwen2.5-3B, Llama 3.2 3B | Small size, quantizable |
| Multimodal tasks | Llama 3.2-Vision, Qwen2-VL, InternVL | Native image support |
Architecture of Typical Fine-Tuning Project
Phase 1: Task and Data Audit (1–2 weeks)
├── Formalize task (classification/generation/extraction)
├── Inventory existing data
├── Assess required volume and quality
└── Choose base model and training method
Phase 2: Data Preparation (2–6 weeks)
├── Collect and aggregate sources
├── Clean (duplicates, noise, PII)
├── Label (manual/synthetic/combined)
├── Format to chat template
└── Train/val/test split (80/10/10)
Phase 3: Training (1–4 weeks)
├── Baseline evaluation of base model
├── First LoRA/QLoRA run with defaults
├── Analyze training/val loss curves
├── Hyperparameter tuning
└── Full Fine-Tuning if needed
Phase 4: Evaluation and Iterations (1–3 weeks)
├── Automatic metrics (F1, BLEU, ROUGE, accuracy)
├── LLM-as-judge (GPT-4o or strong model as evaluator)
├── Human evaluation of sample
└── Failure case analysis → data refinement
Phase 5: Deployment and Monitoring (1–2 weeks)
├── Quantization (optional)
├── Deployment via vLLM/TGI
├── Monitoring setup
└── A/B test vs baseline
Synthetic Data Generation via Strong Model
Common scenario: client has no labeled data but unstructured sources (documents, regulations, FAQ). Use GPT-4o or Claude to auto-generate training pairs:
from openai import OpenAI
import json
client = OpenAI()
def generate_training_example(document_chunk: str, num_examples: int = 5) -> list:
"""Generate training pairs from document fragment"""
prompt = f"""You are expert in creating datasets for language model training.
Based on document fragment below, create {num_examples} "question-answer" pairs in JSON format.
Questions should be diverse: factual, analytical, practical.
Answers should be accurate, based only on document text.
Document:
{document_chunk}
Return JSON array: [{{"question": "...", "answer": "..."}}]"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.7
)
return json.loads(response.choices[0].message.content)["pairs"]
Important: synthetic data requires manual verification (min 10–15%) for quality control. GPT-4o hallucinations in synthetic data will enter training and degrade quality.
Practical Case: Telemedicine Specialization
Task: assistant for primary care physicians — differential diagnosis from patient complaints, exam recommendations, ICD-10 code selection.
Source data:
- 450 clinical cases with conclusions (from medical system, anonymized)
- Clinical guidelines from RF Ministry of Health for 12 conditions (PDF, 3200 pages)
- ICD-10 reference
Strategy:
- Convert clinical guidelines into chunks
- Synthetic generation of 3200 examples via GPT-4o (complaints → diagnostics)
- Verify 15% sample with practicing physicians
- Fine-tune Qwen2.5-14B (best Russian for medical terminology)
Results (after 4 epochs QLoRA, r=32):
- Top-3 accuracy for ICD-10: 71% → 89%
- Exam recommendation completeness (recall vs expert): 0.62 → 0.84
- Hallucination rate (invented drugs/procedures): 24% → 6%
- Latency (vLLM, A100): 1.8s per request
Production Quality Monitoring
After deployment, set up continuous monitoring system:
import mlflow
# Log predictions for drift analysis
with mlflow.start_run():
mlflow.log_metrics({
"avg_response_length": avg_len,
"refusal_rate": refusal_rate,
"latency_p95": latency_p95,
"user_rating_avg": rating_avg,
})
Model degradation signs: increased refusal rate, decreased user ratings, higher escalation in downstream systems.
Infrastructure Requirements
| Method | Model | GPU | VRAM | Training Time |
|---|---|---|---|---|
| QLoRA | 7B | 1×A100 40GB | 18 GB | 2–6h |
| QLoRA | 14B | 1×A100 80GB | 35 GB | 4–12h |
| QLoRA | 70B | 2×A100 80GB | 90 GB | 12–36h |
| Full FT | 7B | 4×A100 40GB | 120 GB | 8–24h |
| Full FT | 70B | 8×H100 80GB | 560 GB | 48–120h |
Full Cycle Timeline
Minimal project (ready data, simple task): 3–5 weeks. Typical project (data prep from scratch): 8–14 weeks. Complex project (specialized domain, iterative labeling): 16–24 weeks.







