Speech-to-Speech Implementation for an AI Voice Assistant. An AI voice assistant is a full-cycle solution: the user speaks, the assistant understands, and responds verbally. The key metric is end-to-end latency: from the end of the user's speech to the beginning of the response. Target: < 1.5 seconds. ### System Architecture
Microphone → VAD → STT → NLU/LLM → TTS → Speaker
↑ ↓
Endpointing First audio chunk
(600–800ms) (<300ms after TTS start)
```### Full pipeline on OpenAI```python
import asyncio
from openai import AsyncOpenAI
import sounddevice as sd
import numpy as np
client = AsyncOpenAI()
class VoiceAssistant:
def __init__(self):
self.conversation_history = []
self.system_prompt = "Ты полезный голосовой ассистент. Отвечай кратко, 1–3 предложения."
async def listen_and_respond(self):
# Запись через VAD
audio = await self.record_speech()
# STT
transcript = await client.audio.transcriptions.create(
model="whisper-1",
file=("audio.wav", audio, "audio/wav"),
language="ru"
)
user_text = transcript.text
print(f"User: {user_text}")
# LLM
self.conversation_history.append({"role": "user", "content": user_text})
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": self.system_prompt}]
+ self.conversation_history,
)
assistant_text = response.choices[0].message.content
self.conversation_history.append({"role": "assistant", "content": assistant_text})
print(f"Assistant: {assistant_text}")
# TTS streaming
async with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input=assistant_text,
response_format="pcm",
) as tts_response:
async for chunk in tts_response.iter_bytes(1024):
# Воспроизводим чанки по мере поступления
audio_data = np.frombuffer(chunk, dtype=np.int16)
sd.play(audio_data.astype(np.float32) / 32768.0, samplerate=24000)
sd.wait()
```### OpenAI Realtime API (optimal for production)```python
import websockets
async def realtime_voice_assistant():
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1"
}
async with websockets.connect(url, extra_headers=headers) as ws:
# Конфигурация
await ws.send(json.dumps({
"type": "session.update",
"session": {
"voice": "alloy",
"instructions": "Ты голосовой ассистент. Отвечай по-русски.",
"turn_detection": {"type": "server_vad"}
}
}))
# ...обработка событий
```### Performance Metrics | Component | Latency | |-----------|---------| | VAD + Endpointing | 600–800 ms | | Whisper-1 API | 300–600 ms | | GPT-4o-mini | 200–500 ms | | TTS-1 first chunk | 200–400 ms | | **Total** | **1.3–2.3 sec** | OpenAI Realtime API: end-to-end latency ~500–800 ms. Timeline: Voice assistant MVP — 1 week. Production with OpenAI Realtime API — 2–3 weeks.







