TTS Model Training (VITS/XTTS)
Training a custom TTS model gives full control over voice, language, and style — no dependency on external APIs and no recurring costs. Ideal for unique brand voice, rare language synthesis, edge deployment.
Architecture Choice
For most tasks: XTTS v2 for quick start with minimal data, VITS for full training with clean dataset.
Dataset Preparation
Requirements:
- Format: 22050 Hz, 16-bit, mono WAV
- Duration: 2–15 sec per clip
- Minimum: 1000 clips for intelligible TTS
- Recommended: 3000–5000 clips for high quality
Timeline: dataset prep — 2–4 weeks. VITS training — 1–2 weeks (GPU). Full cycle — 4–6 weeks.







