Whisper Fine-Tuning for Client Domain Vocabulary

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Whisper Fine-Tuning for Client Domain Vocabulary
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Retraining Whisper for the customer's domain-specific vocabulary The basic Whisper Large v3 shows a WER of 6–9% on standard Russian speech, but errors increase to 25–40% on medical terms, legal formulations, or technical product names. Retraining for a specific domain reduces the WER to 3–8% on the target vocabulary. ### When retraining is necessary - Specific terminology with zero or little coverage in the training data - Strong regional or professional accent - Low-quality recording (8 kHz telephony, noisy conditions) - Code-switching (a mixture of Russian and English technical terms) - Proper nouns: names of products, brands, people ### Dataset preparation The minimum volume for significant improvement: 10–30 hours of labeled audio from the target domain. For a narrow specialization (one speaker, clean conditions), 2–5 hours is sufficient.

Training format (HuggingFace datasets):```python from datasets import Dataset, Audio import pandas as pd

Формат: audio path + transcript

data = pd.read_csv("transcripts.csv") # columns: audio_path, text dataset = Dataset.from_pandas(data) dataset = dataset.cast_column("audio_path", Audio(sampling_rate=16000)) Data requirements: - Sample rate: 16 kHz - Format: WAV (preferred) or FLAC - Markup: full text without abbreviations of non-standard words - Segment length: 5–30 seconds ### Fine-tuning pipeline Using `transformers` + `Seq2SeqTrainer`:python from transformers import ( WhisperForConditionalGeneration, WhisperProcessor, Seq2SeqTrainer, Seq2SeqTrainingArguments )

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3") processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3", language="Russian")

training_args = Seq2SeqTrainingArguments( output_dir="./whisper-medical-ru", per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=1e-5, warmup_steps=500, max_steps=4000, gradient_checkpointing=True, fp16=True, evaluation_strategy="steps", eval_steps=500, save_steps=500, generation_max_length=225, predict_with_generate=True, load_best_model_at_end=True, metric_for_best_model="wer", greater_is_better=False, ) ### The Parameter-Efficient Fine-Tuning (PEFT) training strategy via LoRA allows for the retraining of only 1–2% of parameters while maintaining quality:python from peft import LoraConfig, get_peft_model

lora_config = LoraConfig( r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none" ) model = get_peft_model(model, lora_config)

Minimum configuration: 1x A100 80GB. Training time with 20 hours of data: - 4,000 steps, batch 16: ~8 hours on A100 - Cost on AWS (p4d.24xlarge): ~$160 For a smaller budget, training on an RTX 4090 with gradient checkpointing and fp16: the same 4,000 steps will take ~24–36 hours. ### Project Timeline - Data preparation and labeling: 1–2 weeks (depending on the availability of transcripts) - Training and hyperparameter selection: 3–5 days - Testing and validation: 3–5 days - Total: 3–4 weeks