Multi-Speaker TTS Implementation (Multiple Voices in One System)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Multi-Speaker TTS Implementation (Multiple Voices in One System)
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Multi-Speaker TTS Implementation (Multiple Voices in a Single System) Multi-Speaker TTS is a system in which multiple voices coexist within a single architecture. It is necessary for audiobooks with multiple characters, dialog scenarios, and IVR with different roles. ### Multi-speaker System Architecture

from dataclasses import dataclass
from enum import Enum

class SpeakerRole(Enum):
    ASSISTANT = "assistant"
    NARRATOR = "narrator"
    CHARACTER_1 = "character_1"
    CHARACTER_2 = "character_2"

@dataclass
class Speaker:
    role: SpeakerRole
    name: str
    voice_config: dict
    reference_audio: str | None = None

class MultiSpeakerTTS:
    def __init__(self, speakers: list[Speaker]):
        self.speakers = {s.role: s for s in speakers}
        self._init_engines()

    def synthesize(self, text: str, role: SpeakerRole) -> bytes:
        speaker = self.speakers[role]
        return self._synthesize_with_config(text, speaker.voice_config)
```### Implementation on XTTS v2```python
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Предзагружаем speaker latents для скорости
SPEAKERS = {
    "narrator": "voices/narrator.wav",
    "alice": "voices/alice.wav",
    "bob": "voices/bob.wav",
}

def synthesize_dialog(dialog: list[dict]) -> list[bytes]:
    """
    dialog: [{"speaker": "alice", "text": "Привет!"},
              {"speaker": "bob", "text": "Здравствуй!"}]
    """
    results = []
    for line in dialog:
        speaker_wav = SPEAKERS[line["speaker"]]
        wav = tts.tts(
            text=line["text"],
            speaker_wav=speaker_wav,
            language="ru"
        )
        results.append(wav)
    return results
```### Cloud multi-speaker via Azure Azure Neural TTS supports multiple voices in a single SSML document:```xml
<speak version='1.0' xml:lang='ru-RU'>
  <voice name='ru-RU-DmitryNeural'>
    Добрый день! Это Дмитрий.
  </voice>
  <break time='300ms'/>
  <voice name='ru-RU-SvetlanaNeural'>
    Привет! А это Светлана.
  </voice>
</speak>
```### Dialogue editing```python
from pydub import AudioSegment

def assemble_dialog(audio_clips: list[bytes], pause_ms: int = 300) -> bytes:
    combined = AudioSegment.empty()
    silence = AudioSegment.silent(duration=pause_ms)

    for i, clip in enumerate(audio_clips):
        segment = AudioSegment.from_wav(io.BytesIO(clip))
        combined += segment
        if i < len(audio_clips) - 1:
            combined += silence

    output = io.BytesIO()
    combined.export(output, format="mp3")
    return output.getvalue()
```Timeframe: Multi-speaker cloud system – 2–3 days. Self-hosted with voice control – 1 week.