Speech Recognition and Synthesis: ASR, TTS, Voice Cloning
Customer arrives with task: transcribe 40,000 hours of call center recordings in week. Cloud ASR (Google Speech-to-Text) gives 28% WER on industry vocabulary and costs significantly. Task — reduce WER below 10% and move to self-hosted inference.
Common Problems
WER not converging to required metric. Usually not architecture but data: noisy audio without normalization (-23 LUFS vs standard), mixed languages, accent, domain-specific vocabulary. Whisper large-v3 gives 8–12% WER on clean Russian, fails to 25–35% on PSTN artifacts and narrow-band G.711 codec.
Diarization breaks with >2 speakers. pyannote/speaker-diarization-3.1 stable at 2–3 speakers, DER grows 6% → 18–22% at 5+ conference participants. Overlapping speech worsens: default min_duration_on=0.1 cuts short interjections.
Voice cloning — latency or quality. XTTS v2 (Coqui) natural but streaming at stream_chunk_size=20 first chunk arrives 1.4–2.0 s — unacceptable for interactive. StyleTTS2, Kokoro faster but require precise reference audio prep.
Practical Solution Stack
Baseline production pipeline:
-
ASR:
openai/whisper-large-v3orfaster-whisper(CTranslate2 backend, 4x speed vs original) -
Diarization:
pyannote.audio 3.x+ integration viawhisperxfor word alignment - TTS: XTTS v2 for quality, Edge-TTS or Silero for low latency
- Cloning: XTTS v2 (3–6 s reference audio) or OpenVoice v2
Typical call-center pipeline: Kafka queue audio → normalize via ffmpeg -af loudnorm to -23 LUFS → faster-whisper with beam_size=5, vad_filter=True → pyannote diarization → post-processing (punctuation via deepmultilingualpunctuation) → PostgreSQL with timestamps.
Real case. Fintech 12,000 calls/day. Russian with banking vocabulary — 22% WER (Google STT). After fine-tuning whisper-medium on 200h annotated via Hugging Face transformers + Seq2SeqTrainer with learning_rate=1e-5, warmup_steps=500 — WER dropped to 7.3%. Inference on A10G via faster-whisper at compute_type=float16 processes 40-min call in 55 seconds. Cost $0.0008/min vs $0.016/min cloud.
Fine-tuning Whisper on Domain Data
When general model insufficient, fine-tuning first tool. Minimum for improvement — 20–30h annotated audio in target domain. Annotation iterative: run through base → manually fix 10–15% errors → retrain → repeat.
training_args = Seq2SeqTrainingArguments(
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
learning_rate=1e-5,
warmup_steps=500,
max_steps=5000,
fp16=True,
predict_with_generate=True,
generation_max_length=225,
)
Critical: freeze encoder first 1000 steps (model.freeze_encoder()), else acoustic features diverge before decoder adapts to new vocabulary.
Speech Synthesis: Task Selection
| Model | Latency (TTFB) | Naturalness MOS | Cloning | Languages |
|---|---|---|---|---|
| XTTS v2 | 1.2–2.0 s | 4.1–4.3 | Yes, 3 s ref | 17 |
| StyleTTS2 | 0.3–0.6 s | 4.0–4.2 | Yes, requires tuning | en, + fine-tune |
| Kokoro-82M | 0.08–0.15 s | 3.7–3.9 | No | en, ja |
| Silero TTS | 0.05–0.1 s | 3.4–3.6 | No | ru, en, de, etc |
| Edge-TTS | ~0.4 s (cloud) | 4.0 | No | 100+ |
For interactive bots with TTFB < 300ms — Silero or Kokoro. For content voiceover where naturalness matters — XTTS v2 with streaming via WebSocket.
Workflow
Start with audit session: take 2–4h your recordings, run through models, measure WER/CER, analyze error distribution (lexical, acoustic, language). 1–2 days, immediately shows if fine-tuning needed.
Then — select architecture for throughput: single GPU for 1000 min/day or cluster for 100,000+ min/day. Deploy via Docker + FastAPI or Triton for batched inference.
Timelines depend on complexity: basic model integration — 1–2 weeks. Fine-tuning with data prep and validation — 4–8 weeks. Complete voice pipeline (ASR + diarization + TTS + monitoring) — 2–4 months.







