Streaming Real-Time Speech Recognition (Streaming STT) Implementation Streaming STT transmits audio in chunks and returns partial results with a latency of 100–500 ms—unlike the batch approach, which waits until the recording is complete. This is a key requirement for voice bots, live captions, and agent-assistance systems. ### Streaming Pipeline Architecture
Microphone/WebRTC → WebSocket Server → STT Engine → NLP Processing → Response
16kHz PCM <200ms RTT partial+final intent/NER UI/TTS
```Key parameters: - **Chunk size**: 100–250 ms (optimal balance of delay and quality) - **VAD**: required to detect pauses in speech - **Endpointing**: detects the end of a statement to send the final result ### WebSocket server on FastAPI```python
from fastapi import FastAPI, WebSocket
from faster_whisper import WhisperModel
import numpy as np
import asyncio
app = FastAPI()
model = WhisperModel("medium", device="cuda", compute_type="float16")
@app.websocket("/stream")
async def stream_stt(websocket: WebSocket):
await websocket.accept()
audio_buffer = bytearray()
try:
while True:
chunk = await websocket.receive_bytes()
audio_buffer.extend(chunk)
# Транскрибируем каждые 2 секунды накопленного аудио
if len(audio_buffer) >= 32000 * 2: # 2 sec @ 16kHz 16-bit
audio_array = np.frombuffer(audio_buffer, dtype=np.int16).astype(np.float32) / 32768.0
segments, _ = model.transcribe(audio_array, language="ru")
partial_text = " ".join([s.text for s in segments])
await websocket.send_json({
"type": "partial",
"text": partial_text
})
audio_buffer = bytearray()
except Exception:
await websocket.close()
```### Streaming Engine Comparison | Engine | Latency | Languages | Cost | |--------|---------|----------| | Deepgram Nova-2 | 100–200 ms | 30+ | $0.0043/min | | Google STT Streaming | 150–300 ms | 125+ | $0.006/min | | Azure Speech | 150–300 ms | 100+ | $0.01/min | | faster-whisper (self) | 200–500 ms | 99 | ~$0.001/min | | Vosk (self, CPU) | 300–700 ms | 20+ | ~$0/min | ### VAD for endpointing```python
import webrtcvad
vad = webrtcvad.Vad(2) # aggressiveness 0-3
def is_speech(audio_chunk: bytes, sample_rate: int = 16000) -> bool:
return vad.is_speech(audio_chunk, sample_rate)
```WebRTC VAD processes chunks of exactly 10, 20, or 30 ms. To determine the end of a phrase, there must be 500–800 ms of silence after the last speech frame. ### Client-side (browser)```javascript
const socket = new WebSocket('wss://api.example.com/stream');
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const recorder = new MediaRecorder(mediaStream, {
mimeType: 'audio/webm;codecs=opus'
});
recorder.ondataavailable = (event) => {
if (socket.readyState === WebSocket.OPEN) {
socket.send(event.data);
}
};
recorder.start(250); // chunks each 250ms
```### Implementation timeline - Basic WebSocket streamer with cloud STT: 3-4 days - Self-hosted with VAD and endpointing: 1 week - Full pipeline with frontend and error handling: 2 weeks







