Multi-Speaker TTS Implementation (Multiple Voices in a Single System) Multi-Speaker TTS is a system in which multiple voices coexist within a single architecture. It is necessary for audiobooks with multiple characters, dialog scenarios, and IVR with different roles. ### Multi-speaker System Architecture
from dataclasses import dataclass
from enum import Enum
class SpeakerRole(Enum):
ASSISTANT = "assistant"
NARRATOR = "narrator"
CHARACTER_1 = "character_1"
CHARACTER_2 = "character_2"
@dataclass
class Speaker:
role: SpeakerRole
name: str
voice_config: dict
reference_audio: str | None = None
class MultiSpeakerTTS:
def __init__(self, speakers: list[Speaker]):
self.speakers = {s.role: s for s in speakers}
self._init_engines()
def synthesize(self, text: str, role: SpeakerRole) -> bytes:
speaker = self.speakers[role]
return self._synthesize_with_config(text, speaker.voice_config)
```### Implementation on XTTS v2```python
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
# Предзагружаем speaker latents для скорости
SPEAKERS = {
"narrator": "voices/narrator.wav",
"alice": "voices/alice.wav",
"bob": "voices/bob.wav",
}
def synthesize_dialog(dialog: list[dict]) -> list[bytes]:
"""
dialog: [{"speaker": "alice", "text": "Привет!"},
{"speaker": "bob", "text": "Здравствуй!"}]
"""
results = []
for line in dialog:
speaker_wav = SPEAKERS[line["speaker"]]
wav = tts.tts(
text=line["text"],
speaker_wav=speaker_wav,
language="ru"
)
results.append(wav)
return results
```### Cloud multi-speaker via Azure Azure Neural TTS supports multiple voices in a single SSML document:```xml
<speak version='1.0' xml:lang='ru-RU'>
<voice name='ru-RU-DmitryNeural'>
Добрый день! Это Дмитрий.
</voice>
<break time='300ms'/>
<voice name='ru-RU-SvetlanaNeural'>
Привет! А это Светлана.
</voice>
</speak>
```### Dialogue editing```python
from pydub import AudioSegment
def assemble_dialog(audio_clips: list[bytes], pause_ms: int = 300) -> bytes:
combined = AudioSegment.empty()
silence = AudioSegment.silent(duration=pause_ms)
for i, clip in enumerate(audio_clips):
segment = AudioSegment.from_wav(io.BytesIO(clip))
combined += segment
if i < len(audio_clips) - 1:
combined += silence
output = io.BytesIO()
combined.export(output, format="mp3")
return output.getvalue()
```Timeframe: Multi-speaker cloud system – 2–3 days. Self-hosted with voice control – 1 week.







