Real-Time Streaming Speech Recognition Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Real-Time Streaming Speech Recognition Implementation
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Streaming Real-Time Speech Recognition (Streaming STT) Implementation Streaming STT transmits audio in chunks and returns partial results with a latency of 100–500 ms—unlike the batch approach, which waits until the recording is complete. This is a key requirement for voice bots, live captions, and agent-assistance systems. ### Streaming Pipeline Architecture

Microphone/WebRTC → WebSocket Server → STT Engine → NLP Processing → Response
     16kHz PCM         <200ms RTT       partial+final    intent/NER     UI/TTS
```Key parameters: - **Chunk size**: 100–250 ms (optimal balance of delay and quality) - **VAD**: required to detect pauses in speech - **Endpointing**: detects the end of a statement to send the final result ### WebSocket server on FastAPI```python
from fastapi import FastAPI, WebSocket
from faster_whisper import WhisperModel
import numpy as np
import asyncio

app = FastAPI()
model = WhisperModel("medium", device="cuda", compute_type="float16")

@app.websocket("/stream")
async def stream_stt(websocket: WebSocket):
    await websocket.accept()
    audio_buffer = bytearray()

    try:
        while True:
            chunk = await websocket.receive_bytes()
            audio_buffer.extend(chunk)

            # Транскрибируем каждые 2 секунды накопленного аудио
            if len(audio_buffer) >= 32000 * 2:  # 2 sec @ 16kHz 16-bit
                audio_array = np.frombuffer(audio_buffer, dtype=np.int16).astype(np.float32) / 32768.0
                segments, _ = model.transcribe(audio_array, language="ru")
                partial_text = " ".join([s.text for s in segments])

                await websocket.send_json({
                    "type": "partial",
                    "text": partial_text
                })
                audio_buffer = bytearray()

    except Exception:
        await websocket.close()
```### Streaming Engine Comparison | Engine | Latency | Languages | Cost | |--------|---------|----------| | Deepgram Nova-2 | 100–200 ms | 30+ | $0.0043/min | | Google STT Streaming | 150–300 ms | 125+ | $0.006/min | | Azure Speech | 150–300 ms | 100+ | $0.01/min | | faster-whisper (self) | 200–500 ms | 99 | ~$0.001/min | | Vosk (self, CPU) | 300–700 ms | 20+ | ~$0/min | ### VAD for endpointing```python
import webrtcvad

vad = webrtcvad.Vad(2)  # aggressiveness 0-3

def is_speech(audio_chunk: bytes, sample_rate: int = 16000) -> bool:
    return vad.is_speech(audio_chunk, sample_rate)
```WebRTC VAD processes chunks of exactly 10, 20, or 30 ms. To determine the end of a phrase, there must be 500–800 ms of silence after the last speech frame. ### Client-side (browser)```javascript
const socket = new WebSocket('wss://api.example.com/stream');
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const recorder = new MediaRecorder(mediaStream, {
    mimeType: 'audio/webm;codecs=opus'
});

recorder.ondataavailable = (event) => {
    if (socket.readyState === WebSocket.OPEN) {
        socket.send(event.data);
    }
};
recorder.start(250);  // chunks each 250ms
```### Implementation timeline - Basic WebSocket streamer with cloud STT: 3-4 days - Self-hosted with VAD and endpointing: 1 week - Full pipeline with frontend and error handling: 2 weeks