Embedding Model Fine-Tuning for Client Domain

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Offered services

Showing 1 of 1 servicesAll 1566 services

Medium

from 1 week to 3 months

FAQ

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1240
Development of a web application for FEEDME
1167
Website development for BELFINGROUP
867
Development of an online store for the company FURNORO
1084
B2B Advance company logo design
563
Development of a web application for Enviok
829

Show more works

Embedding Model Fine-tuning for Customer Domain

Standard embedding models are trained on general web text and academic corpora. In specialized domains (medicine, law, finance, technical standards), they poorly distinguish specific terminology and show reduced retrieval recall. Fine-tuning the embedding model for a specific domain improves search quality without changing the entire RAG infrastructure.

When Embedding Model Fine-tuning is Necessary

Signs indicating the need for fine-tuning:

The general embedding model poorly distinguishes domain-specific terms (MeSH terms, legal constructs, technical abbreviations)
Context recall of the RAG system is stuck below 0.75 despite optimization of chunking and search
High rate of "false positives" — semantically similar but thematically irrelevant documents appear in top-K

Training Data Preparation: Triplet vs Contrastive

Triplet Loss: anchor, positive (relevant document), negative (irrelevant document).

Contrastive / Multi Negative Ranking Loss: pairs (query, relevant document). Negatives are taken from other documents in the batch.

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Training data: pairs (query, relevant_document)
train_examples = [
    InputExample(texts=["what is the statute of limitations for labor disputes",
                        "The statute of limitations for individual labor disputes is 3 months..."]),
    InputExample(texts=["calculation of temporary disability benefits",
                        "The amount of temporary disability benefits is determined by..."]),
    # ... more examples
]

# Load base model
model = SentenceTransformer("BAAI/bge-m3")

# Multi Negative Ranking Loss (more efficient than triplet for retrieval)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model)

# Training
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./bge-m3-legal-finetuned",
    show_progress_bar=True,
)

Generating Training Pairs with LLM

To create a training dataset without manual labeling:

from openai import OpenAI
import json

client = OpenAI()

def generate_queries_for_document(doc_text: str, n: int = 5) -> list[str]:
    """Generates search queries that the document answers"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Generate {n} search queries that a user might ask
to find the following document. Queries should vary in phrasing,
but all should be answered by this document.
Return a JSON list of strings.

Document:
{doc_text[:1000]}"""
        }],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)["queries"]

# Generate dataset from 1000 documents
training_pairs = []
for doc in domain_documents:
    queries = generate_queries_for_document(doc.text)
    for query in queries:
        training_pairs.append(InputExample(texts=[query, doc.text]))

print(f"Generated {len(training_pairs)} training pairs")
# Typically: 1000 documents × 5 queries = 5000 pairs in ~2 hours and $5-15

Evaluating Retrieval Improvement

from sentence_transformers.evaluation import InformationRetrievalEvaluator

# Test set: {query_id: query, corpus_id: doc, relevant_docs: {query_id: [doc_ids]}}
evaluator = InformationRetrievalEvaluator(
    queries=test_queries,
    corpus=test_corpus,
    relevant_docs=relevance_labels,
    precision_recall_at_k=[1, 5, 10],
    ndcg_at_k=[10],
    show_progress_bar=True,
)

# Comparison of base and fine-tuned models
base_model = SentenceTransformer("BAAI/bge-m3")
finetuned_model = SentenceTransformer("./bge-m3-legal-finetuned")

base_results = evaluator(base_model, output_path="./eval_base")
ft_results = evaluator(finetuned_model, output_path="./eval_finetuned")

Practical Case Study: Legal Documents

Base model: BAAI/bge-m3. Dataset: 8000 pairs (synthetic: 6500 via GPT-4o-mini + 1500 manual). Domain: Russian labor and civil law.

Metric	Before FT	After FT
NDCG@10	0.68	0.84
Recall@5	0.61	0.79
MRR@5	0.65	0.82
Latency (inference)	unchanged	unchanged

+24% improvement in NDCG without infrastructure changes — only model weight updates.

Timeline

Training dataset generation: 3–7 days
Fine-tuning (bge-m3, 5000 pairs): 2–4 hours (A100)
Evaluation and comparison: 2–3 days
Total: 1–2 weeks