Streaming TTS implementation: Streaming TTS starts playing audio before all text is generated—time-to-first-audio drops from 1–3 seconds to 100–400 ms. This is critical for voice bots, where the user hears the response immediately. ### Streaming TTS principle: Text is broken into sentences → each sentence is synthesized separately → chunks are sent to the client as they are ready. While the client listens to the first chunk, the server prepares the second. ### Implementation with OpenAI TTS Streaming
from openai import AsyncOpenAI
import asyncio
client = AsyncOpenAI()
async def stream_tts(text: str):
"""Потоковая генерация TTS через OpenAI"""
async with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input=text,
response_format="pcm", # raw PCM для минимальной задержки
) as response:
async for chunk in response.iter_bytes(chunk_size=4096):
yield chunk
```### WebSocket server for real-time TTS```python
from fastapi import FastAPI, WebSocket
from TTS.api import TTS
import numpy as np
import asyncio
app = FastAPI()
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
def split_into_sentences(text: str) -> list[str]:
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
return [s.strip() for s in sentences if s.strip()]
@app.websocket("/tts-stream")
async def tts_websocket(websocket: WebSocket):
await websocket.accept()
try:
while True:
text = await websocket.receive_text()
sentences = split_into_sentences(text)
for sentence in sentences:
wav = await asyncio.get_event_loop().run_in_executor(
None,
lambda s=sentence: tts.tts(
text=s,
language="ru",
speaker_wav="default.wav"
)
)
# Конвертируем в bytes и отправляем
audio_bytes = (np.array(wav) * 32767).astype(np.int16).tobytes()
await websocket.send_bytes(audio_bytes)
# Сигнал конца синтеза
await websocket.send_json({"type": "done"})
except Exception:
await websocket.close()
```### Latency optimization **Chunk-based synthesis**: We break text into 10-20 word phrases, synthesize them, and stream them in parallel. **Pre-generation** for template phrases like "Please wait," "Just a moment" are cached and served without synthesis. **ElevenLabs Streaming API**:```python
from elevenlabs.client import ElevenLabs
client = ElevenLabs()
audio_stream = client.text_to_speech.convert_as_stream(
voice_id="voice_id",
text="Текст для потокового синтеза",
model_id="eleven_turbo_v2_5", # 75 мс задержка
)
```### Time-to-first-audio by engine | TTS | TTFA | |-----|------| | ElevenLabs Turbo | ~100 ms | | OpenAI TTS-1 streaming | ~200 ms | | Azure Neural TTS streaming | ~150 ms | | Coqui XTTS (self-hosted, GPU) | ~300–500 ms | | Yandex SpeechKit | ~200–300 ms | Deadlines: cloud streaming TTS integration — 2–3 days. Self-hosted WebSocket TTS server — 1 week.







