Retraining Whisper for the customer's domain-specific vocabulary The basic Whisper Large v3 shows a WER of 6–9% on standard Russian speech, but errors increase to 25–40% on medical terms, legal formulations, or technical product names. Retraining for a specific domain reduces the WER to 3–8% on the target vocabulary. ### When retraining is necessary - Specific terminology with zero or little coverage in the training data - Strong regional or professional accent - Low-quality recording (8 kHz telephony, noisy conditions) - Code-switching (a mixture of Russian and English technical terms) - Proper nouns: names of products, brands, people ### Dataset preparation The minimum volume for significant improvement: 10–30 hours of labeled audio from the target domain. For a narrow specialization (one speaker, clean conditions), 2–5 hours is sufficient.
Training format (HuggingFace datasets):```python from datasets import Dataset, Audio import pandas as pd
Формат: audio path + transcript
data = pd.read_csv("transcripts.csv") # columns: audio_path, text
dataset = Dataset.from_pandas(data)
dataset = dataset.cast_column("audio_path", Audio(sampling_rate=16000))
Data requirements: - Sample rate: 16 kHz - Format: WAV (preferred) or FLAC - Markup: full text without abbreviations of non-standard words - Segment length: 5–30 seconds ### Fine-tuning pipeline Using `transformers` + `Seq2SeqTrainer`:python
from transformers import (
WhisperForConditionalGeneration,
WhisperProcessor,
Seq2SeqTrainer,
Seq2SeqTrainingArguments
)
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3") processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3", language="Russian")
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-medical-ru",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-5,
warmup_steps=500,
max_steps=4000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
eval_steps=500,
save_steps=500,
generation_max_length=225,
predict_with_generate=True,
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
)
### The Parameter-Efficient Fine-Tuning (PEFT) training strategy via LoRA allows for the retraining of only 1–2% of parameters while maintaining quality:python
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig( r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none" ) model = get_peft_model(model, lora_config)
Minimum configuration: 1x A100 80GB. Training time with 20 hours of data: - 4,000 steps, batch 16: ~8 hours on A100 - Cost on AWS (p4d.24xlarge): ~$160 For a smaller budget, training on an RTX 4090 with gradient checkpointing and fp16: the same 4,000 steps will take ~24–36 hours. ### Project Timeline - Data preparation and labeling: 1–2 weeks (depending on the availability of transcripts) - Training and hyperparameter selection: 3–5 days - Testing and validation: 3–5 days - Total: 3–4 weeks







