What metrics should I use to evaluate a fine-tuned model?

For evaluating a fine-tuned LLM, we use a hierarchical approach: automatic metrics (BLEU, ROUGE, perplexity) for quick checks, LLM-as-judge for more accurate assessment, and human evaluation for final validation. The choice depends on the task—ROUGE for summarization, BLEU for translation, perplexity for model confidence.

What is LLM-as-judge and how do I set it up?

LLM-as-judge is a method where a strong model (e.g., GPT-4o) evaluates the responses of the tested model against criteria like accuracy, completeness, and structure. Setup involves crafting a prompt with a reference answer and rating scale, and parsing the JSON response. This approach correlates with human judgment 1.5 times better than automatic metrics.

How do I avoid catastrophic forgetting during fine-tuning?

Catastrophic forgetting shows as increased perplexity on general benchmarks (e.g., MMLU) after fine-tuning. To minimize it, we mix general data into the training set, use low learning rates (1e-5–5e-5), apply LoRA/QLoRA with small rank (r=8–16), and use EWC regularization. The optimal balance is found by monitoring metrics on a held-out set.

How long does model evaluation take?

Building an evaluation pipeline takes 3–5 days, running all automatic metrics a few hours, LLM-as-judge on 1000 examples 1–2 days (cost $5–20), and human evaluation of 200 examples about a week. A full evaluation cycle for one fine-tuning iteration usually takes 1–2 weeks.

Why does perplexity increase on general tests after fine-tuning?

An increase in perplexity on general benchmarks (e.g., MMLU) after fine-tuning indicates catastrophic forgetting. The model overfits to the narrow domain and loses general knowledge. If the increase is under 10–15% and the model is used only in that domain, it's often acceptable. To reduce the effect, we mix general data and use early stopping.

What metrics should I use to evaluate a fine-tuned model?

For evaluating a fine-tuned LLM, we use a hierarchical approach: automatic metrics (BLEU, ROUGE, perplexity) for quick checks, LLM-as-judge for more accurate assessment, and human evaluation for final validation. The choice depends on the task—ROUGE for summarization, BLEU for translation, perplexity for model confidence.

What is LLM-as-judge and how do I set it up?

LLM-as-judge is a method where a strong model (e.g., GPT-4o) evaluates the responses of the tested model against criteria like accuracy, completeness, and structure. Setup involves crafting a prompt with a reference answer and rating scale, and parsing the JSON response. This approach correlates with human judgment 1.5 times better than automatic metrics.

How do I avoid catastrophic forgetting during fine-tuning?

Catastrophic forgetting shows as increased perplexity on general benchmarks (e.g., MMLU) after fine-tuning. To minimize it, we mix general data into the training set, use low learning rates (1e-5–5e-5), apply LoRA/QLoRA with small rank (r=8–16), and use EWC regularization. The optimal balance is found by monitoring metrics on a held-out set.

How long does model evaluation take?

Building an evaluation pipeline takes 3–5 days, running all automatic metrics a few hours, LLM-as-judge on 1000 examples 1–2 days (cost $5–20), and human evaluation of 200 examples about a week. A full evaluation cycle for one fine-tuning iteration usually takes 1–2 weeks.

Why does perplexity increase on general tests after fine-tuning?

An increase in perplexity on general benchmarks (e.g., MMLU) after fine-tuning indicates catastrophic forgetting. The model overfits to the narrow domain and loses general knowledge. If the increase is under 10–15% and the model is used only in that domain, it's often acceptable. To reduce the effect, we mix general data and use early stopping.

How to Evaluate Fine-Tuned LLMs: Metrics, Judges, and Human Review

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

How to Evaluate Fine-Tuned LLMs: Metrics, Judges, and Human Review

Medium

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

After each fine-tuning iteration, the question arises: did it get better? Without a structured metrics system, it's easy to fall into a trap—the model may overfit, lose general knowledge, or degrade on edge cases. We've seen projects where teams did 15+ iterations "by eye" until they built an evaluation pipeline. Our approach—a hierarchy of metrics from fast automatic ones to expensive human evaluation—lets you stop at the right moment and guarantee quality.

Why comprehensive evaluation of a fine-tuned model matters?

One metric doesn't tell the whole story. BLEU may be high, but the model may hallucinate facts. Perplexity may drop, but response usefulness may decline. Only a combination of automatic metrics, LLM-as-judge, and human evaluation shows the true quality of a fine-tuned neural network. We've been using this approach in all projects for over 5 years—accumulating a portfolio of 50+ fine-tuning projects.

Hierarchy of evaluation metrics

Level 1: Automatic metrics

Fast, cheap, work without human. Provide rough assessment but are indispensable for rapid iterations.

Level 2: LLM-as-judge

A strong model (GPT-4o, Claude 3.5) evaluates the responses of the tested model. Correlates well with human judgment under the right prompt.

Level 3: Human evaluation

Gold standard. Expensive, but necessary for final validation and calibrating lower levels. Typically 200–500 examples suffice.

Comparison of metric types:

Metric Type	Speed	Cost	Correlation with Human
Automatic (BLEU, ROUGE, perplexity)	Instant	Low	50-60%
LLM-as-judge	1-2 days per 1000 examples	Medium	80-85%
Human evaluation	Week per 200 examples	High	95-100%

How LLM-as-judge complements automatic metrics?

Automatic metrics (BLEU, ROUGE) only measure surface overlap with a reference. LLM-as-judge evaluates semantics: factual accuracy, completeness, logic. In one case, base Llama 3.1 8B showed BLEU-4=0.18, and after fine-tuning—0.39, but the LLM judge revealed the model started inventing legal articles. Without the judge, this would have been missed. LLM-as-judge correlates with human evaluation at 85%, which is 1.5 times higher than automatic metrics. Also, LLM-as-judge is crucial for evaluating text generation where automatic metrics are powerless. Using an evaluation pipeline reduces iteration time by 3 times compared to ad-hoc testing.

Practical implementation of LLM-as-judge

Example implementation of LLM-as-judge

from openai import OpenAI

JUDGE_PROMPT = """You are a strict expert evaluating the quality of AI assistant responses.

Question: {question}

Assistant's answer: {answer}

Reference answer: {reference}

Evaluate the answer on the following criteria (each 1–5):
1. Factual accuracy
2. Completeness of coverage
3. Structuredness
4. Style appropriateness

Return JSON: {{"accuracy": X, "completeness": X, "structure": X, "style": X, "overall": X, "reasoning": "..."}}"""

def llm_judge(question: str, answer: str, reference: str, client: OpenAI) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(question=question, answer=answer, reference=reference)
        }],
        response_format={"type": "json_object"},
        temperature=0.1
    )
    return json.loads(response.choices[0].message.content)

Automatic metrics: implementation in code

Below is an example of computing BLEU and ROUGE for an evaluation pipeline. The code can be used as a template.

from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
from rouge_score import rouge_scorer

references = [[ref.split()] for ref in reference_list]
hypotheses = [hyp.split() for hyp in hypothesis_list]

bleu_4 = corpus_bleu(references, hypotheses,
    weights=(0.25, 0.25, 0.25, 0.25),
    smoothing_function=SmoothingFunction().method1)

scorer = rouge_scorer.RougeScorer(['rouge1','rouge2','rougeL'], use_stemmer=True)
rouge_scores = [scorer.score(ref, hyp) for ref, hyp in zip(reference_list, hypothesis_list)]

Perplexity—model confidence metric

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, texts: list[str]) -> float:
    total_loss = 0
    total_tokens = 0
    model.eval()
    with torch.no_grad():
        for text in texts:
            encodings = tokenizer(text, return_tensors="pt").to(model.device)
            outputs = model(**encodings, labels=encodings["input_ids"])
            total_loss += outputs.loss.item() * encodings["input_ids"].shape[1]
            total_tokens += encodings["input_ids"].shape[1]
    avg_loss = total_loss / total_tokens
    return torch.exp(torch.tensor(avg_loss)).item()

Classification and extraction—evaluate via F1 and accuracy

from sklearn.metrics import classification_report, f1_score

def evaluate_classification(model_outputs: list, ground_truth: list) -> dict:
    predictions = []
    for output in model_outputs:
        try:
            pred = json.loads(output)["category"]
        except:
            pred = "parse_error"
        predictions.append(pred)
    report = classification_report(ground_truth, predictions, output_dict=True)
    return {
        "macro_f1": report["macro avg"]["f1-score"],
        "weighted_f1": report["weighted avg"]["f1-score"],
        "accuracy": report["accuracy"],
        "per_class": {k: v for k, v in report.items() if isinstance(v, dict) and k not in ["macro avg", "weighted avg"]}
    }

Practical example: comprehensive evaluation of a fine-tuned model

Base model: Llama 3.1 8B Instruct. Fine-tuning: QLoRA r=16, 2000 examples of legal documents.

Metric	Base model	Fine-tuned	Change
ROUGE-L	0.41	0.67	+63%
BLEU-4	0.18	0.39	+117%
Perplexity (domain)	24.3	11.8	-51%
Perplexity (MMLU)	8.2	9.1	+11% (forgetting)
LLM-judge overall	3.1	4.3	+39%
F1 (NER category)	0.61	0.89	+46%

Perplexity on MMLU increased by 11%—moderate catastrophic forgetting, but for a narrow legal use case this is acceptable. Solution: add 10% general data in the next round. This fine-tuning benchmark helps quantify forgetting.

What our model evaluation work includes

Development of a custom evaluation pipeline for your task (including scripts, configs, and documentation)
Setup of automatic metrics (BLEU, ROUGE, perplexity, F1, accuracy)
Integration of LLM-as-judge with judge model selection and prompt calibration
Conducting human evaluation (engaging experts, collecting annotations)
Building a metrics dashboard and monitoring (MLflow, Weights & Biases)
Delivery of full documentation, access credentials, and a training session
Ongoing support and maintenance for 3 months post-deployment
Final report with recommendations for model improvement
Training your team to work with the pipeline

Investment: from $3,000 for a basic pipeline to $10,000 for a comprehensive system. Savings on unnecessary iterations can reach 40% of project budget.

Start with an evaluation pipeline—contact us for a model audit. Our method ensures reliable LLM quality assessment.

Timeline and evaluation process

Task and metrics analysis—1–2 days
Pipeline development and integration—3–5 days
Automatic evaluation—a few hours
LLM-as-judge (up to 1000 examples)—1–2 days
Human evaluation—1 week
Report and recommendations—1–2 days

Total for one fine-tuning iteration—1–2 weeks. For a comprehensive project with multiple rounds—from 3 weeks.

Post-deployment monitoring

import mlflow

def log_inference_quality(prompt, response, user_feedback):
    with mlflow.start_run(run_name="production-monitoring"):
        mlflow.log_metrics({
            "response_length": len(response.split()),
            "refusal_detected": int("I cannot" in response.lower()),
            "user_rating": user_feedback.get("rating", -1),
        })

Monitoring in production is mandatory—we guarantee the model doesn't degrade over time. Certified engineers from our team accompany the project throughout its lifecycle. Our team has 5+ years of experience in fine-tuning and has completed over 50 projects.

If you want to implement systematic evaluation, contact us. We'll help build a pipeline and save on unnecessary iterations. Request a consultation: we'll evaluate your project in 2 days.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.