Real-Time Streaming TTS Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Real-Time Streaming TTS Implementation
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Streaming TTS implementation: Streaming TTS starts playing audio before all text is generated—time-to-first-audio drops from 1–3 seconds to 100–400 ms. This is critical for voice bots, where the user hears the response immediately. ### Streaming TTS principle: Text is broken into sentences → each sentence is synthesized separately → chunks are sent to the client as they are ready. While the client listens to the first chunk, the server prepares the second. ### Implementation with OpenAI TTS Streaming

from openai import AsyncOpenAI
import asyncio

client = AsyncOpenAI()

async def stream_tts(text: str):
    """Потоковая генерация TTS через OpenAI"""
    async with client.audio.speech.with_streaming_response.create(
        model="tts-1",
        voice="alloy",
        input=text,
        response_format="pcm",  # raw PCM для минимальной задержки
    ) as response:
        async for chunk in response.iter_bytes(chunk_size=4096):
            yield chunk
```### WebSocket server for real-time TTS```python
from fastapi import FastAPI, WebSocket
from TTS.api import TTS
import numpy as np
import asyncio

app = FastAPI()
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

def split_into_sentences(text: str) -> list[str]:
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if s.strip()]

@app.websocket("/tts-stream")
async def tts_websocket(websocket: WebSocket):
    await websocket.accept()
    try:
        while True:
            text = await websocket.receive_text()
            sentences = split_into_sentences(text)

            for sentence in sentences:
                wav = await asyncio.get_event_loop().run_in_executor(
                    None,
                    lambda s=sentence: tts.tts(
                        text=s,
                        language="ru",
                        speaker_wav="default.wav"
                    )
                )
                # Конвертируем в bytes и отправляем
                audio_bytes = (np.array(wav) * 32767).astype(np.int16).tobytes()
                await websocket.send_bytes(audio_bytes)

            # Сигнал конца синтеза
            await websocket.send_json({"type": "done"})
    except Exception:
        await websocket.close()
```### Latency optimization **Chunk-based synthesis**: We break text into 10-20 word phrases, synthesize them, and stream them in parallel. **Pre-generation** for template phrases like &quot;Please wait,&quot; &quot;Just a moment&quot; are cached and served without synthesis. **ElevenLabs Streaming API**:```python
from elevenlabs.client import ElevenLabs

client = ElevenLabs()
audio_stream = client.text_to_speech.convert_as_stream(
    voice_id="voice_id",
    text="Текст для потокового синтеза",
    model_id="eleven_turbo_v2_5",  # 75 мс задержка
)
```### Time-to-first-audio by engine | TTS | TTFA | |-----|------| | ElevenLabs Turbo | ~100 ms | | OpenAI TTS-1 streaming | ~200 ms | | Azure Neural TTS streaming | ~150 ms | | Coqui XTTS (self-hosted, GPU) | ~300–500 ms | | Yandex SpeechKit | ~200–300 ms | Deadlines: cloud streaming TTS integration — 2–3 days. Self-hosted WebSocket TTS server — 1 week.