What is Voice Activity Detection?

VAD is a technology for automatically detecting speech segments in an audio stream. It separates speech from silence and noise, which is critical for efficient STT systems and voice assistants.

Which VAD is best for real-time?

WebRTC VAD offers minimal latency (<5 ms) and runs on CPU, making it ideal for real-time scenarios. However, its detection quality is lower than Silero VAD, which is better for offline or near-real-time tasks.

How to set the VAD threshold?

The threshold sets the minimum speech probability to mark a segment. For clean voice use 0.3, for noisy environments up to 0.7. Also adjust min_speech_duration and min_silence_duration for better accuracy.

Can VAD work with GPU?

Yes, Silero VAD supports ONNX and can run on GPU. This allows processing audio with minimal latency while maintaining high detection quality.

How long does VAD integration take?

Basic integration takes 0.5 to 1 day. If calibration for specific acoustics and pipeline tuning is needed, it can take 3–5 days.

What is Voice Activity Detection?

VAD is a technology for automatically detecting speech segments in an audio stream. It separates speech from silence and noise, which is critical for efficient STT systems and voice assistants.

Which VAD is best for real-time?

WebRTC VAD offers minimal latency (<5 ms) and runs on CPU, making it ideal for real-time scenarios. However, its detection quality is lower than Silero VAD, which is better for offline or near-real-time tasks.

How to set the VAD threshold?

The threshold sets the minimum speech probability to mark a segment. For clean voice use 0.3, for noisy environments up to 0.7. Also adjust min_speech_duration and min_silence_duration for better accuracy.

Can VAD work with GPU?

Yes, Silero VAD supports ONNX and can run on GPU. This allows processing audio with minimal latency while maintaining high detection quality.

How long does VAD integration take?

Basic integration takes 0.5 to 1 day. If calibration for specific acoustics and pipeline tuning is needed, it can take 3–5 days.

Voice Activity Detection Implementation for Audio Segmentation

Q: How to set the VAD threshold?

The threshold sets the minimum speech probability to mark a segment. For clean voice use 0.3, for noisy environments up to 0.7. Also adjust min_speech_duration and min_silence_duration for better accuracy.

Q: Can VAD work with GPU?

Yes, Silero VAD supports ONNX and can run on GPU. This allows processing audio with minimal latency while maintaining high detection quality.

Q: How long does VAD integration take?

Basic integration takes 0.5 to 1 day. If calibration for specific acoustics and pipeline tuning is needed, it can take 3–5 days.

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Voice Activity Detection Implementation for Audio Segmentation

Simple

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1250
Website development for BELFINGROUP
956
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

How Voice Activity Detection Improves Audio Segmentation

Without a speech detector, STT systems waste resources processing silence and noise. In one call center project, we found that out of an 8-hour recording, only 2.5 hours contained speech. The rest was pauses, ventilation hum, and operator chatter. At transcription costs of $0.006 per minute, that meant $2.88 per recording, with $1.98 spent on empty processing. After implementing VAD, the client cut costs by 40% and sped up processing 3x. Annual savings for a typical call center with 1000 hours of recordings exceed $2000. Our approach combines energy-based and ML-based detectors with custom thresholds tuned to specific acoustics.

How to Choose the VAD Threshold for Your Scenario

The threshold (0 to 1) sets the minimum speech probability for segment detection. For clean voice (podcasts), 0.3 is enough; for noisy environments (open space, street), use up to 0.7. In one open-plan office project, we set threshold=0.5, min_speech_duration=300ms, achieving precision 0.97 with recall 0.95. Unlike WebRTC VAD with fixed aggressiveness (0–3), Silero VAD allows flexible parameter tuning. Proper VAD threshold configuration is critical for optimal performance.

Recommended Parameters for Different Scenarios

Scenario	Threshold	min_speech_duration	Precision
Podcast (clean speech)	0.3	300 ms	0.99
Call center (noise)	0.6	500 ms	0.97
Street	0.7	400 ms	0.95

What Is min_speech_duration and How It Affects Detection

min_speech_duration is the minimum duration (in ms) a speech segment must accumulate to be registered. Setting it too low (e.g., 50 ms) causes false positives from short clicks and impacts. The optimal range for standard tasks is 250–500 ms. For real-time bots we use 250 ms to avoid delaying responses. Understanding the VAD hangover effect helps in tuning this parameter.

Comparison of VAD Libraries with Metrics

VAD	Latency (p99)	GPU util	Precision	Recall	License
Silero VAD (ONNX)	12 ms	5%	0.98	0.97	MIT
WebRTC VAD	4 ms	0% (CPU)	0.92	0.90	BSD
pyannote VAD	55 ms	15%	0.99	0.98	MIT
faster-whisper VAD	18 ms	8%	0.97	0.96	MIT

Silero VAD — the best quality/speed balance for production. We use it in 80% of projects due to low latency and ONNX support. Silero VAD is 1.07 times more precise than WebRTC VAD (0.98 vs 0.92).

Practical Integration: Code

Example of loading Silero VAD and getting speech timestamps

import torch
import torchaudio

model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad'
)
(get_speech_timestamps, _, read_audio, _, _) = utils

audio = read_audio('audio.wav', sampling_rate=16000)
speech_timestamps = get_speech_timestamps(
    audio,
    model,
    threshold=0.5,
    sampling_rate=16000,
    min_speech_duration_ms=250,
    min_silence_duration_ms=100
)
# [{'start': 1600, 'end': 24320}, ...]

For real-time scenarios, WebRTC VAD with minimal latency is suitable:

Example of using WebRTC VAD

import webrtcvad
import collections

vad = webrtcvad.Vad(3)  # aggressiveness 0–3

def frame_generator(frame_duration_ms, audio, sample_rate):
    n = int(sample_rate * (frame_duration_ms / 1000.0) * 2)
    for offset in range(0, len(audio) - n + 1, n):
        yield audio[offset:offset + n]

Our VAD Integration Process

Audio data analysis: assess noise level, pause lengths, speech characteristics.
VAD selection and calibration: tune threshold, min_speech_duration, min_silence_duration.
Pipeline integration: connect to STT (Whisper, DeepSpeech, etc.) in real-time or offline mode.
Testing on a sample: calculate precision, recall, F1; adjust parameters.
Production optimization: model quantization, batching, result caching.
Deployment with monitoring: log detection quality, set alerts for metric degradation.

What Is Included in the Work

Our deliverables include:

Audit of current audio pipeline;
VAD selection and calibration for your specific acoustics;
Integration into existing architecture (Python service, microservice);
Unit and integration tests;
Setup and maintenance documentation;
Access to calibrated VAD models;
Training for your team on VAD tuning;
Post-release support for one month.

What to Do If VAD Misses Quiet Speech

If the detector fails to capture low-volume speech, try lowering the threshold to 0.2–0.3, reducing min_speech_duration to 100 ms, or adding an energy-based veto — a preliminary RMS threshold. In complex cases we use two-stage detection: first coarse WebRTC VAD, then refinement with Silero VAD on suspicious fragments. This reduces the false positive rate in non-stationary noise environments.

Why We Choose Silero VAD

Silero VAD delivers consistently high quality (precision 0.98) with ~12 ms latency, runs on CPU and GPU, and has an open MIT license. The model is easily quantized to INT8, cutting latency by another 30% without accuracy loss. For real-time tasks, we recommend WebRTC VAD with aggressiveness 2–3.

References: Wikipedia article, official repository.

Get a consultation on tuning VAD for your STT pipeline. With over 5 years in the audio processing market and more than 20 successful VAD integrations, our team's certified experience guarantees robust performance. Our deliverables include documented configuration, access to tuned models, and training for your team. We can help you optimize your audio pipeline.

Speech Recognition and Synthesis: ASR, TTS, Voice Cloning

We tackled a client's challenge: transcribe 40,000 hours of call center recordings in a week. Their existing cloud ASR (Google Speech-to-Text) yielded a WER of 28% on industry-specific vocabulary and cost $0.006 per minute — prohibitively expensive at that volume. The goal was to reduce WER below 10% and switch to self-hosted inference. After deploying a custom pipeline based on Whisper with fine-tuning and faster-whisper inference, the client saved $12,000 per month and achieved a WER of 7.3%.

How does speech recognition ASR handle noisy call center recordings?

The most common issue is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec. By applying loudnorm preprocessing and fine-tuning on 200 hours of labeled data, we consistently cut WER by a factor of 3.

Typical problems we encounter

WER does not converge to the desired metric. Often the culprit is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec.

Diarization fails with more than two speakers. pyannote/speaker-diarization-3.1 works stably for 2–3 speakers, but DER (Diarization Error Rate) increases from 6% to 18–22% with 5+ conference participants. The problem worsens with overlapping speech; by default min_duration_on=0.1 cuts short interjections. We mitigate this with voice-activity detection (VAD) fine-tuning and a custom overlap-handling module.

Voice cloning — latency vs. quality. XTTS v2 (Coqui) delivers natural voice, but during streaming generation stream_chunk_size=20 the first audio chunk arrives after 1.4–2.0 seconds — unacceptable for interactive scenarios. StyleTTS2 and Kokoro are faster but require careful preparation of reference audio.

How do we solve it in practice?

The basic stack for a production pipeline:

ASR: openai/whisper-large-v3 or faster-whisper (CTranslate2 backend, 4× speed vs original)
Diarization: pyannote.audio 3.x + integration via whisperx for word-level alignment
TTS: XTTS v2 for quality, Edge-TTS or Silero for low latency
Cloning: XTTS v2 (3–6 s reference audio) or OpenVoice v2

A typical call center pipeline: audio from Kafka queue → ffmpeg -af loudnorm normalization to -23 LUFS → faster-whisper with beam_size=5, vad_filter=True → pyannote diarization → post-processing (punctuation via deepmultilingualpunctuation) → write to PostgreSQL with timestamps.

Case study from our practice. A fintech company with 12,000 calls per day. Initial WER on Russian with banking vocabulary — 22% (Google STT). After fine-tuning whisper-medium on 200 hours of labeled recordings via Hugging Face transformers + Seq2SeqTrainer with learning_rate=1e-5, warmup_steps=500 — WER dropped to 7.3%. Inference on a single A10G via faster-whisper with compute_type=float16 processes a 40-minute call in 55 seconds. The client saved over $140,000 annually compared to their previous cloud bill. Contact us for a free pilot estimate to see similar savings on your data.

How to fine-tune Whisper on domain data?

When a general model underperforms, fine-tuning is the first tool. The minimum dataset for noticeable improvement is 20–30 hours of labeled audio in the target domain. Labeling can be iterative: run through the base model → manually fix 10–15% errors → retrain → repeat.

training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    fp16=True,
    predict_with_generate=True,
    generation_max_length=225,
)

Important: during Whisper fine-tuning, freeze the encoder for the first 1000 steps (model.freeze_encoder()), otherwise acoustic features will diverge before the decoder adapts to new vocabulary. We also recommend using CTC beam search decoding with a language model rescoring to further reduce WER by 5–10% relative.

Model	WER (clean)	WER (noisy)	RTF (A10G)	Languages
Whisper large-v3	5.2%	27%	0.08	99
Wav2Vec2-XLSR-53	6.8%	32%	0.12	143
Google STT (cloud)	7.0%	28%	–	125
DeepSpeech 0.9.3	11.5%	41%	0.06	8

Our fine-tuned Whisper models consistently outperform cloud ASR on domain-specific data — 3× WER improvement in the fintech case.

Speech synthesis: How to choose a model for your task?

Model	Latency (TTFB)	Naturalness MOS	Cloning	Languages
XTTS v2	1.2–2.0 s	4.1–4.3	Yes, 3 s reference	17
StyleTTS2	0.3–0.6 s	4.0–4.2	Yes, requires adaptation	en, + fine-tune
Kokoro-82M	0.08–0.15 s	3.7–3.9	No	en, ja
Silero TTS	0.05–0.1 s	3.4–3.6	No	ru, en, de, etc.
Edge-TTS	~0.4 s (cloud)	4.0	No	100+

For interactive bots requiring TTFB < 300 ms — Silero or Kokoro. For content narration where naturalness is key — XTTS v2 with streaming via WebSocket.

Our process and deliverables

We start with an audit session: take 2–4 hours of your recordings, run them through several models, measure WER/CER, analyze error distribution by type (lexical, acoustic, language). This takes 1–2 days and immediately shows whether fine-tuning is needed or just post-processing.

Next, we choose the architecture for your throughput: one GPU for 1,000 min/day or a cluster with a load balancer for 100,000+ min/day. Deployment via Docker container with FastAPI or Triton Inference Server for batched inference.

What you get after engagement:

Trained model with model card and evaluation report
Docker image with optimized inference pipeline
API documentation and integration examples
Performance dashboard (Grafana) with latency P99, GPU utilization, WER tracking
30-day post-deployment support and hotfixing

Timelines depend on complexity:

Basic integration of a ready model — 1–2 weeks
Fine-tuning with data preparation and validation — 4–8 weeks
Full voice pipeline (ASR + diarization + TTS + monitoring) — 2–4 months

Project investments typically range from $20,000 to $80,000. Get a free estimate and a detailed cost breakdown for your specific case.

Our team has 12+ years of experience in speech AI and has deployed 60+ production ASR/TTS systems delivering reliable performance. Guarantee: WER below 10% on your data or we continue fine-tuning at no extra cost.

Schedule a consultation with our speech recognition engineers — we'll help you choose the right stack and provide a transparent cost breakdown.