Speech-to-Speech for Voice AI Assistant Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Speech-to-Speech for Voice AI Assistant Implementation
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Speech-to-Speech Implementation for an AI Voice Assistant. An AI voice assistant is a full-cycle solution: the user speaks, the assistant understands, and responds verbally. The key metric is end-to-end latency: from the end of the user's speech to the beginning of the response. Target: < 1.5 seconds. ### System Architecture

Microphone → VAD → STT → NLU/LLM → TTS → Speaker
                ↑                         ↓
           Endpointing              First audio chunk
           (600–800ms)              (<300ms after TTS start)
```### Full pipeline on OpenAI```python
import asyncio
from openai import AsyncOpenAI
import sounddevice as sd
import numpy as np

client = AsyncOpenAI()

class VoiceAssistant:
    def __init__(self):
        self.conversation_history = []
        self.system_prompt = "Ты полезный голосовой ассистент. Отвечай кратко, 1–3 предложения."

    async def listen_and_respond(self):
        # Запись через VAD
        audio = await self.record_speech()

        # STT
        transcript = await client.audio.transcriptions.create(
            model="whisper-1",
            file=("audio.wav", audio, "audio/wav"),
            language="ru"
        )
        user_text = transcript.text
        print(f"User: {user_text}")

        # LLM
        self.conversation_history.append({"role": "user", "content": user_text})
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "system", "content": self.system_prompt}]
                      + self.conversation_history,
        )
        assistant_text = response.choices[0].message.content
        self.conversation_history.append({"role": "assistant", "content": assistant_text})
        print(f"Assistant: {assistant_text}")

        # TTS streaming
        async with client.audio.speech.with_streaming_response.create(
            model="tts-1",
            voice="alloy",
            input=assistant_text,
            response_format="pcm",
        ) as tts_response:
            async for chunk in tts_response.iter_bytes(1024):
                # Воспроизводим чанки по мере поступления
                audio_data = np.frombuffer(chunk, dtype=np.int16)
                sd.play(audio_data.astype(np.float32) / 32768.0, samplerate=24000)
                sd.wait()
```### OpenAI Realtime API (optimal for production)```python
import websockets

async def realtime_voice_assistant():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Конфигурация
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "voice": "alloy",
                "instructions": "Ты голосовой ассистент. Отвечай по-русски.",
                "turn_detection": {"type": "server_vad"}
            }
        }))
        # ...обработка событий
```### Performance Metrics | Component | Latency | |-----------|---------| | VAD + Endpointing | 600–800 ms | | Whisper-1 API | 300–600 ms | | GPT-4o-mini | 200–500 ms | | TTS-1 first chunk | 200–400 ms | | **Total** | **1.3–2.3 sec** | OpenAI Realtime API: end-to-end latency ~500–800 ms. Timeline: Voice assistant MVP — 1 week. Production with OpenAI Realtime API — 2–3 weeks.