Embedding Model Fine-Tuning for Client Domain

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Embedding Model Fine-Tuning for Client Domain
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Embedding Model Fine-tuning for Customer Domain

Standard embedding models are trained on general web text and academic corpora. In specialized domains (medicine, law, finance, technical standards), they poorly distinguish specific terminology and show reduced retrieval recall. Fine-tuning the embedding model for a specific domain improves search quality without changing the entire RAG infrastructure.

When Embedding Model Fine-tuning is Necessary

Signs indicating the need for fine-tuning:

  • The general embedding model poorly distinguishes domain-specific terms (MeSH terms, legal constructs, technical abbreviations)
  • Context recall of the RAG system is stuck below 0.75 despite optimization of chunking and search
  • High rate of "false positives" — semantically similar but thematically irrelevant documents appear in top-K

Training Data Preparation: Triplet vs Contrastive

Triplet Loss: anchor, positive (relevant document), negative (irrelevant document).

Contrastive / Multi Negative Ranking Loss: pairs (query, relevant document). Negatives are taken from other documents in the batch.

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Training data: pairs (query, relevant_document)
train_examples = [
    InputExample(texts=["what is the statute of limitations for labor disputes",
                        "The statute of limitations for individual labor disputes is 3 months..."]),
    InputExample(texts=["calculation of temporary disability benefits",
                        "The amount of temporary disability benefits is determined by..."]),
    # ... more examples
]

# Load base model
model = SentenceTransformer("BAAI/bge-m3")

# Multi Negative Ranking Loss (more efficient than triplet for retrieval)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model)

# Training
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./bge-m3-legal-finetuned",
    show_progress_bar=True,
)

Generating Training Pairs with LLM

To create a training dataset without manual labeling:

from openai import OpenAI
import json

client = OpenAI()

def generate_queries_for_document(doc_text: str, n: int = 5) -> list[str]:
    """Generates search queries that the document answers"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Generate {n} search queries that a user might ask
to find the following document. Queries should vary in phrasing,
but all should be answered by this document.
Return a JSON list of strings.

Document:
{doc_text[:1000]}"""
        }],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)["queries"]

# Generate dataset from 1000 documents
training_pairs = []
for doc in domain_documents:
    queries = generate_queries_for_document(doc.text)
    for query in queries:
        training_pairs.append(InputExample(texts=[query, doc.text]))

print(f"Generated {len(training_pairs)} training pairs")
# Typically: 1000 documents × 5 queries = 5000 pairs in ~2 hours and $5-15

Evaluating Retrieval Improvement

from sentence_transformers.evaluation import InformationRetrievalEvaluator

# Test set: {query_id: query, corpus_id: doc, relevant_docs: {query_id: [doc_ids]}}
evaluator = InformationRetrievalEvaluator(
    queries=test_queries,
    corpus=test_corpus,
    relevant_docs=relevance_labels,
    precision_recall_at_k=[1, 5, 10],
    ndcg_at_k=[10],
    show_progress_bar=True,
)

# Comparison of base and fine-tuned models
base_model = SentenceTransformer("BAAI/bge-m3")
finetuned_model = SentenceTransformer("./bge-m3-legal-finetuned")

base_results = evaluator(base_model, output_path="./eval_base")
ft_results = evaluator(finetuned_model, output_path="./eval_finetuned")

Practical Case Study: Legal Documents

Base model: BAAI/bge-m3. Dataset: 8000 pairs (synthetic: 6500 via GPT-4o-mini + 1500 manual). Domain: Russian labor and civil law.

Metric Before FT After FT
NDCG@10 0.68 0.84
Recall@5 0.61 0.79
MRR@5 0.65 0.82
Latency (inference) unchanged unchanged

+24% improvement in NDCG without infrastructure changes — only model weight updates.

Timeline

  • Training dataset generation: 3–7 days
  • Fine-tuning (bge-m3, 5000 pairs): 2–4 hours (A100)
  • Evaluation and comparison: 2–3 days
  • Total: 1–2 weeks