How long does it take to develop a TTS system?

It depends on complexity. Basic cloud TTS integration takes 2–3 days, self-hosted with queue about a week, full system with custom voice 3–4 weeks. Exact timelines are determined during analysis.

Can you clone a voice from scratch?

Yes, modern models like Coqui XTTS v2 clone a voice from 6–10 seconds of recording. For high accuracy, 10–30 minutes of speaker material is needed. The result is a unique voice with natural intonations.

Which engine is best for Russian?

For Russian, Yandex SpeechKit (cloud, low latency) and Silero TTS (open-source, excellent quality) are optimal. Coqui XTTS also supports Russian and gives good results with customization.

Do I need a GPU for self-hosted TTS?

Yes, for low latency (under 500 ms) a CUDA-capable GPU is required. T4 or V100 works for experimentation; A10G or A100 is recommended for production. Piper can run on CPU but with higher latency.

What is the difference between cloud and self-hosted TTS?

Cloud TTS is faster to deploy, requires no GPU or infrastructure, but cost depends on generation volume. Self-hosted provides full data control and predictable costs but needs setup and a GPU. Choice depends on latency and privacy requirements.

How long does it take to develop a TTS system?

It depends on complexity. Basic cloud TTS integration takes 2–3 days, self-hosted with queue about a week, full system with custom voice 3–4 weeks. Exact timelines are determined during analysis.

Can you clone a voice from scratch?

Yes, modern models like Coqui XTTS v2 clone a voice from 6–10 seconds of recording. For high accuracy, 10–30 minutes of speaker material is needed. The result is a unique voice with natural intonations.

Which engine is best for Russian?

For Russian, Yandex SpeechKit (cloud, low latency) and Silero TTS (open-source, excellent quality) are optimal. Coqui XTTS also supports Russian and gives good results with customization.

Do I need a GPU for self-hosted TTS?

Yes, for low latency (under 500 ms) a CUDA-capable GPU is required. T4 or V100 works for experimentation; A10G or A100 is recommended for production. Piper can run on CPU but with higher latency.

What is the difference between cloud and self-hosted TTS?

Cloud TTS is faster to deploy, requires no GPU or infrastructure, but cost depends on generation volume. Self-hosted provides full data control and predictable costs but needs setup and a GPU. Choice depends on latency and privacy requirements.

Text-to-Speech System: Speech Synthesis with Voice Customization

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Text-to-Speech System: Speech Synthesis with Voice Customization

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1250
Website development for BELFINGROUP
956
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

Text-to-Speech System: Speech Synthesis with Voice Customization

You launch a voice assistant. The first problem—synthesis latency: if it exceeds 500 ms, users drop the call. The second—an unnatural voice reduces trust. Building a TTS system is not just picking an engine; it's an integration that accounts for latency, cost, and customization. Our engineers have 10+ years of experience in NLP and audio processing, and we have deployed 5 major TTS systems for banks and telecom operators. After customizing XTTS v2 to a host's voice, call retention increased by 22%.

Modern neural synthesizers like Coqui XTTS v2 and ElevenLabs generate speech indistinguishable from human. Latency is 200–500 ms. A self-hosted solution with a custom voice scores 40% higher in MOS than cloud generic synthesis. For volumes exceeding 100,000 generations per month, self-hosted is 30% cheaper than cloud.

How to Choose a TTS Engine for Production

The choice depends on the scenario. For a voice bot, low latency is critical—Azure Speech or Yandex SpeechKit are suitable. For audiobooks and content, maximum quality is needed—Coqui XTTS or ElevenLabs.

Cloud TTS—fast start, predictable quality:

OpenAI TTS: best quality in English, good in Russian
ElevenLabs: most natural sound, voice cloning
Yandex SpeechKit: optimal for Russian-language products

Self-hosted TTS—data control, predictable cost:

Coqui XTTS v2: multilingual, cloning from 6 seconds
Piper: lightweight, CPU-capable, good quality in Russian
Silero TTS: Russian open-source, excellent Russian

Comparison of cloud vs self-hosted:

Parameter	Cloud	Self-hosted
Latency	100-300 ms	200-500 ms (with GPU)
Cost	Per token/second	Fixed (GPU)
Data control	No	Full
Customization	Limited	Full fine-tuning

What Voice Customization Provides

Standard voices do not fit brands. We perform fine-tuning of a pretrained model on 10–30 minutes of speaker recordings. The result is a unique voice preserving intonations and diction. Such a voice is 40% higher in user MOS than generic synthesis. Example: a voice assistant for a bank after customizing XTTS v2 to a host's voice increased call retention by 22%.

Typical Mistakes in TTS Development

Missing text normalization: numbers, dates, abbreviations must be transformed. Without it, numeric amounts sound unnatural.
Ignoring pauses and punctuation: TTS without pause insertion sounds unnatural, especially in long sentences.
Not considering latency when choosing an engine: for IVR, <200 ms is critical; for audiobooks, 500+ ms is acceptable.
Skimping on GPU for self-hosted: without GPU, latency >1 s, unacceptable for interactive scenarios.

How We Build a TTS System: Process

Scenario and requirements analysis—latency measurements, budget, language.
Engine selection and testing—cloud, self-hosted, custom.
API development and integration—FastAPI, task queue (Celery), caching.
Voice customization—data collection, fine-tuning, MOS evaluation.
Load testing—p99 latency, throughput, GPU utilization.
Deployment and monitoring—Docker, Prometheus, Grafana.

Basic Implementation with FastAPI

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import io
import soundfile as sf
from TTS.api import TTS

app = FastAPI()
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")

@app.post("/synthesize")
async def synthesize(text: str, language: str = "ru"):
    wav = tts.tts(
        text=text,
        language=language,
        speaker_wav="reference_voice.wav"  # для клонирования
    )

    buffer = io.BytesIO()
    sf.write(buffer, wav, samplerate=24000, format='WAV')
    buffer.seek(0)

    return StreamingResponse(buffer, media_type="audio/wav")

Text Preprocessing

Before feeding to TTS, a normalizer is mandatory: expand abbreviations, numbers, dates:

def normalize_for_tts(text: str, language: str = "ru") -> str:
    # numbers: e.g., "500" → "five hundred"
    # abbreviations: "ООО" → "общество с ограниченной ответственностью"
    # dates: преобразуются по правилам языка
    ...

Estimated Timelines

Basic cloud TTS integration: from 2 to 3 days
Self-hosted with queue and caching: from 1 week
Full system with custom voice: from 3 to 4 weeks

Cost is calculated individually after analyzing your scenario.

What's Included

Technical architecture documentation
Access to the code repository
Deployment instructions
Team training (1–2 sessions)
One month of support after delivery

Experience and Guarantees

5 years in the market, 20+ projects in voice interfaces. We guarantee synthesis stability under loads of up to 10,000 requests/day. Certifications: compatibility with Kubernetes, experience with NVIDIA Triton. Contact us to evaluate your project. Order a TTS system with a custom voice—get a consultation on engines and timelines.

Additional information on technologies can be found on the Speech synthesis Wikipedia page.

Speech Recognition and Synthesis: ASR, TTS, Voice Cloning

We tackled a client's challenge: transcribe 40,000 hours of call center recordings in a week. Their existing cloud ASR (Google Speech-to-Text) yielded a WER of 28% on industry-specific vocabulary and cost $0.006 per minute — prohibitively expensive at that volume. The goal was to reduce WER below 10% and switch to self-hosted inference. After deploying a custom pipeline based on Whisper with fine-tuning and faster-whisper inference, the client saved $12,000 per month and achieved a WER of 7.3%.

How does speech recognition ASR handle noisy call center recordings?

The most common issue is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec. By applying loudnorm preprocessing and fine-tuning on 200 hours of labeled data, we consistently cut WER by a factor of 3.

Typical problems we encounter

WER does not converge to the desired metric. Often the culprit is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec.

Diarization fails with more than two speakers. pyannote/speaker-diarization-3.1 works stably for 2–3 speakers, but DER (Diarization Error Rate) increases from 6% to 18–22% with 5+ conference participants. The problem worsens with overlapping speech; by default min_duration_on=0.1 cuts short interjections. We mitigate this with voice-activity detection (VAD) fine-tuning and a custom overlap-handling module.

Voice cloning — latency vs. quality. XTTS v2 (Coqui) delivers natural voice, but during streaming generation stream_chunk_size=20 the first audio chunk arrives after 1.4–2.0 seconds — unacceptable for interactive scenarios. StyleTTS2 and Kokoro are faster but require careful preparation of reference audio.

How do we solve it in practice?

The basic stack for a production pipeline:

ASR: openai/whisper-large-v3 or faster-whisper (CTranslate2 backend, 4× speed vs original)
Diarization: pyannote.audio 3.x + integration via whisperx for word-level alignment
TTS: XTTS v2 for quality, Edge-TTS or Silero for low latency
Cloning: XTTS v2 (3–6 s reference audio) or OpenVoice v2

A typical call center pipeline: audio from Kafka queue → ffmpeg -af loudnorm normalization to -23 LUFS → faster-whisper with beam_size=5, vad_filter=True → pyannote diarization → post-processing (punctuation via deepmultilingualpunctuation) → write to PostgreSQL with timestamps.

Case study from our practice. A fintech company with 12,000 calls per day. Initial WER on Russian with banking vocabulary — 22% (Google STT). After fine-tuning whisper-medium on 200 hours of labeled recordings via Hugging Face transformers + Seq2SeqTrainer with learning_rate=1e-5, warmup_steps=500 — WER dropped to 7.3%. Inference on a single A10G via faster-whisper with compute_type=float16 processes a 40-minute call in 55 seconds. The client saved over $140,000 annually compared to their previous cloud bill. Contact us for a free pilot estimate to see similar savings on your data.

How to fine-tune Whisper on domain data?

When a general model underperforms, fine-tuning is the first tool. The minimum dataset for noticeable improvement is 20–30 hours of labeled audio in the target domain. Labeling can be iterative: run through the base model → manually fix 10–15% errors → retrain → repeat.

training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    fp16=True,
    predict_with_generate=True,
    generation_max_length=225,
)

Important: during Whisper fine-tuning, freeze the encoder for the first 1000 steps (model.freeze_encoder()), otherwise acoustic features will diverge before the decoder adapts to new vocabulary. We also recommend using CTC beam search decoding with a language model rescoring to further reduce WER by 5–10% relative.

Model	WER (clean)	WER (noisy)	RTF (A10G)	Languages
Whisper large-v3	5.2%	27%	0.08	99
Wav2Vec2-XLSR-53	6.8%	32%	0.12	143
Google STT (cloud)	7.0%	28%	–	125
DeepSpeech 0.9.3	11.5%	41%	0.06	8

Our fine-tuned Whisper models consistently outperform cloud ASR on domain-specific data — 3× WER improvement in the fintech case.

Speech synthesis: How to choose a model for your task?

Model	Latency (TTFB)	Naturalness MOS	Cloning	Languages
XTTS v2	1.2–2.0 s	4.1–4.3	Yes, 3 s reference	17
StyleTTS2	0.3–0.6 s	4.0–4.2	Yes, requires adaptation	en, + fine-tune
Kokoro-82M	0.08–0.15 s	3.7–3.9	No	en, ja
Silero TTS	0.05–0.1 s	3.4–3.6	No	ru, en, de, etc.
Edge-TTS	~0.4 s (cloud)	4.0	No	100+

For interactive bots requiring TTFB < 300 ms — Silero or Kokoro. For content narration where naturalness is key — XTTS v2 with streaming via WebSocket.

Our process and deliverables

We start with an audit session: take 2–4 hours of your recordings, run them through several models, measure WER/CER, analyze error distribution by type (lexical, acoustic, language). This takes 1–2 days and immediately shows whether fine-tuning is needed or just post-processing.

Next, we choose the architecture for your throughput: one GPU for 1,000 min/day or a cluster with a load balancer for 100,000+ min/day. Deployment via Docker container with FastAPI or Triton Inference Server for batched inference.

What you get after engagement:

Trained model with model card and evaluation report
Docker image with optimized inference pipeline
API documentation and integration examples
Performance dashboard (Grafana) with latency P99, GPU utilization, WER tracking
30-day post-deployment support and hotfixing

Timelines depend on complexity:

Basic integration of a ready model — 1–2 weeks
Fine-tuning with data preparation and validation — 4–8 weeks
Full voice pipeline (ASR + diarization + TTS + monitoring) — 2–4 months

Project investments typically range from $20,000 to $80,000. Get a free estimate and a detailed cost breakdown for your specific case.

Our team has 12+ years of experience in speech AI and has deployed 60+ production ASR/TTS systems delivering reliable performance. Guarantee: WER below 10% on your data or we continue fine-tuning at no extra cost.

Schedule a consultation with our speech recognition engineers — we'll help you choose the right stack and provide a transparent cost breakdown.