Voice Activity Detection (VAD) implementation for audio segmentation VAD determines whether a given audio fragment contains speech. This is a critical preprocessing step: without VAD, the STT model wastes resources on silence and noise, and voice bots don't know when the user has finished speaking. ### Silero VAD key tools - the best balance of quality and speed for production:
import torch
import torchaudio
model, utils = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_vad'
)
(get_speech_timestamps, _, read_audio, _, _) = utils
audio = read_audio('audio.wav', sampling_rate=16000)
speech_timestamps = get_speech_timestamps(
audio,
model,
threshold=0.5,
sampling_rate=16000,
min_speech_duration_ms=250,
min_silence_duration_ms=100
)
# [{'start': 1600, 'end': 24320}, ...]
```**WebRTC VAD** — minimal latency (<5 ms), suitable for real-time:```python
import webrtcvad
import collections
vad = webrtcvad.Vad(3) # агрессивность 0–3
def frame_generator(frame_duration_ms, audio, sample_rate):
n = int(sample_rate * (frame_duration_ms / 1000.0) * 2)
for offset in range(0, len(audio) - n + 1, n):
yield audio[offset:offset + n]
```### VAD Library Comparison | VAD | Latency | Quality | License | |-----|---------|---------|---------| | Silero VAD | 5–20 ms | Excellent | MIT | | WebRTC VAD | <5 ms | Good | BSD | | pyannote VAD | 50+ ms | Excellent | MIT | | faster-whisper VAD | 10–30 ms | Excellent | MIT | Silero VAD is the recommended choice for most tasks. Integration time: 0.5–1 day.







