Embedding Model Fine-tuning for Customer Domain
Standard embedding models are trained on general web text and academic corpora. In specialized domains (medicine, law, finance, technical standards), they poorly distinguish specific terminology and show reduced retrieval recall. Fine-tuning the embedding model for a specific domain improves search quality without changing the entire RAG infrastructure.
When Embedding Model Fine-tuning is Necessary
Signs indicating the need for fine-tuning:
- The general embedding model poorly distinguishes domain-specific terms (MeSH terms, legal constructs, technical abbreviations)
- Context recall of the RAG system is stuck below 0.75 despite optimization of chunking and search
- High rate of "false positives" — semantically similar but thematically irrelevant documents appear in top-K
Training Data Preparation: Triplet vs Contrastive
Triplet Loss: anchor, positive (relevant document), negative (irrelevant document).
Contrastive / Multi Negative Ranking Loss: pairs (query, relevant document). Negatives are taken from other documents in the batch.
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Training data: pairs (query, relevant_document)
train_examples = [
InputExample(texts=["what is the statute of limitations for labor disputes",
"The statute of limitations for individual labor disputes is 3 months..."]),
InputExample(texts=["calculation of temporary disability benefits",
"The amount of temporary disability benefits is determined by..."]),
# ... more examples
]
# Load base model
model = SentenceTransformer("BAAI/bge-m3")
# Multi Negative Ranking Loss (more efficient than triplet for retrieval)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model)
# Training
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./bge-m3-legal-finetuned",
show_progress_bar=True,
)
Generating Training Pairs with LLM
To create a training dataset without manual labeling:
from openai import OpenAI
import json
client = OpenAI()
def generate_queries_for_document(doc_text: str, n: int = 5) -> list[str]:
"""Generates search queries that the document answers"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Generate {n} search queries that a user might ask
to find the following document. Queries should vary in phrasing,
but all should be answered by this document.
Return a JSON list of strings.
Document:
{doc_text[:1000]}"""
}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)["queries"]
# Generate dataset from 1000 documents
training_pairs = []
for doc in domain_documents:
queries = generate_queries_for_document(doc.text)
for query in queries:
training_pairs.append(InputExample(texts=[query, doc.text]))
print(f"Generated {len(training_pairs)} training pairs")
# Typically: 1000 documents × 5 queries = 5000 pairs in ~2 hours and $5-15
Evaluating Retrieval Improvement
from sentence_transformers.evaluation import InformationRetrievalEvaluator
# Test set: {query_id: query, corpus_id: doc, relevant_docs: {query_id: [doc_ids]}}
evaluator = InformationRetrievalEvaluator(
queries=test_queries,
corpus=test_corpus,
relevant_docs=relevance_labels,
precision_recall_at_k=[1, 5, 10],
ndcg_at_k=[10],
show_progress_bar=True,
)
# Comparison of base and fine-tuned models
base_model = SentenceTransformer("BAAI/bge-m3")
finetuned_model = SentenceTransformer("./bge-m3-legal-finetuned")
base_results = evaluator(base_model, output_path="./eval_base")
ft_results = evaluator(finetuned_model, output_path="./eval_finetuned")
Practical Case Study: Legal Documents
Base model: BAAI/bge-m3.
Dataset: 8000 pairs (synthetic: 6500 via GPT-4o-mini + 1500 manual).
Domain: Russian labor and civil law.
| Metric | Before FT | After FT |
|---|---|---|
| NDCG@10 | 0.68 | 0.84 |
| Recall@5 | 0.61 | 0.79 |
| MRR@5 | 0.65 | 0.82 |
| Latency (inference) | unchanged | unchanged |
+24% improvement in NDCG without infrastructure changes — only model weight updates.
Timeline
- Training dataset generation: 3–7 days
- Fine-tuning (bge-m3, 5000 pairs): 2–4 hours (A100)
- Evaluation and comparison: 2–3 days
- Total: 1–2 weeks







