Speech Technology Solution Development

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 90 of 134 servicesAll 1566 services
Medium
from 1 business day to 3 business days
Complex
from 1 week to 3 months
Medium
from 1 week to 3 months
Medium
from 1 week to 3 months
Medium
from 1 week to 3 months
Medium
from 1 week to 3 months
Simple
from 1 business day to 3 business days
Simple
from 1 business day to 3 business days
Simple
from 1 business day to 3 business days
Simple
from 1 business day to 3 business days
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Speech Recognition and Synthesis: ASR, TTS, Voice Cloning

Customer arrives with task: transcribe 40,000 hours of call center recordings in week. Cloud ASR (Google Speech-to-Text) gives 28% WER on industry vocabulary and costs significantly. Task — reduce WER below 10% and move to self-hosted inference.

Common Problems

WER not converging to required metric. Usually not architecture but data: noisy audio without normalization (-23 LUFS vs standard), mixed languages, accent, domain-specific vocabulary. Whisper large-v3 gives 8–12% WER on clean Russian, fails to 25–35% on PSTN artifacts and narrow-band G.711 codec.

Diarization breaks with >2 speakers. pyannote/speaker-diarization-3.1 stable at 2–3 speakers, DER grows 6% → 18–22% at 5+ conference participants. Overlapping speech worsens: default min_duration_on=0.1 cuts short interjections.

Voice cloning — latency or quality. XTTS v2 (Coqui) natural but streaming at stream_chunk_size=20 first chunk arrives 1.4–2.0 s — unacceptable for interactive. StyleTTS2, Kokoro faster but require precise reference audio prep.

Practical Solution Stack

Baseline production pipeline:

  • ASR: openai/whisper-large-v3 or faster-whisper (CTranslate2 backend, 4x speed vs original)
  • Diarization: pyannote.audio 3.x + integration via whisperx for word alignment
  • TTS: XTTS v2 for quality, Edge-TTS or Silero for low latency
  • Cloning: XTTS v2 (3–6 s reference audio) or OpenVoice v2

Typical call-center pipeline: Kafka queue audio → normalize via ffmpeg -af loudnorm to -23 LUFS → faster-whisper with beam_size=5, vad_filter=Truepyannote diarization → post-processing (punctuation via deepmultilingualpunctuation) → PostgreSQL with timestamps.

Real case. Fintech 12,000 calls/day. Russian with banking vocabulary — 22% WER (Google STT). After fine-tuning whisper-medium on 200h annotated via Hugging Face transformers + Seq2SeqTrainer with learning_rate=1e-5, warmup_steps=500 — WER dropped to 7.3%. Inference on A10G via faster-whisper at compute_type=float16 processes 40-min call in 55 seconds. Cost $0.0008/min vs $0.016/min cloud.

Fine-tuning Whisper on Domain Data

When general model insufficient, fine-tuning first tool. Minimum for improvement — 20–30h annotated audio in target domain. Annotation iterative: run through base → manually fix 10–15% errors → retrain → repeat.

training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    fp16=True,
    predict_with_generate=True,
    generation_max_length=225,
)

Critical: freeze encoder first 1000 steps (model.freeze_encoder()), else acoustic features diverge before decoder adapts to new vocabulary.

Speech Synthesis: Task Selection

Model Latency (TTFB) Naturalness MOS Cloning Languages
XTTS v2 1.2–2.0 s 4.1–4.3 Yes, 3 s ref 17
StyleTTS2 0.3–0.6 s 4.0–4.2 Yes, requires tuning en, + fine-tune
Kokoro-82M 0.08–0.15 s 3.7–3.9 No en, ja
Silero TTS 0.05–0.1 s 3.4–3.6 No ru, en, de, etc
Edge-TTS ~0.4 s (cloud) 4.0 No 100+

For interactive bots with TTFB < 300ms — Silero or Kokoro. For content voiceover where naturalness matters — XTTS v2 with streaming via WebSocket.

Workflow

Start with audit session: take 2–4h your recordings, run through models, measure WER/CER, analyze error distribution (lexical, acoustic, language). 1–2 days, immediately shows if fine-tuning needed.

Then — select architecture for throughput: single GPU for 1000 min/day or cluster for 100,000+ min/day. Deploy via Docker + FastAPI or Triton for batched inference.

Timelines depend on complexity: basic model integration — 1–2 weeks. Fine-tuning with data prep and validation — 4–8 weeks. Complete voice pipeline (ASR + diarization + TTS + monitoring) — 2–4 months.