Speech-to-Speech (Real-Time Voice Translation) System Development Speech-to-Speech (STS) translates speech in one language into speech in another in real time, preserving the speaker's voice characteristics. This is a non-trivial engineering challenge: it is necessary to minimize the end-to-end latency of the entire pipeline while maintaining quality. ### STS System Architecture
STT → Translation → TTS
↓ ↓ ↓
~200мс ~100мс ~300мс
Total: ~600–1000мс
```Components: 1. **STT** — source transcription (Whisper streaming / Deepgram) 2. **MT** — machine translation (GPT-4o, DeepL API, NLLB) 3. **Voice Conversion** — transferring voice characteristics to synthesized speech 4. **TTS** — synthesis of translated text ### Async pipeline implementation```python
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def speech_to_speech_pipeline(
audio_chunk: bytes,
source_lang: str,
target_lang: str,
speaker_voice: str = "alloy"
) -> bytes:
# Этап 1: STT
transcript_response = await client.audio.transcriptions.create(
model="whisper-1",
file=("audio.wav", audio_chunk, "audio/wav"),
language=source_lang
)
transcript = transcript_response.text
if not transcript.strip():
return b""
# Этап 2: Перевод
translation_response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Переведи на {target_lang}. Только перевод, без пояснений."},
{"role": "user", "content": transcript}
],
temperature=0.1
)
translated = translation_response.choices[0].message.content
# Этап 3: TTS
tts_response = await client.audio.speech.create(
model="tts-1",
voice=speaker_voice,
input=translated,
response_format="pcm"
)
return tts_response.content
```### Sentence-level streaming latency optimization: we don't wait for the end of the entire phrase, we translate and synthesize it into sentences:```python
async def streaming_sts(text_stream):
buffer = ""
async for word in text_stream:
buffer += word
# Переводим при обнаружении конца предложения
if buffer.endswith((".", "!", "?")):
yield await translate_and_synthesize(buffer)
buffer = ""
```### Saving the speaker's voice To preserve the voice characteristics during translation, we use voice conversion:```python
# Извлекаем speaker embedding из исходного аудио
# Синтезируем перевод нейтральным голосом
# Применяем voice conversion с embedding оригинала
```### Implementation timeframes - Basic STS without voice preservation: 1 week - With voice conversion and streaming: 3–4 weeks - Production system with scaling: 6–8 weeks







