What is speech endpointing (end-of-speech detection)?

It is a technology that determines when a user has finished speaking. It is critical for voice bots: if the response is too fast, the bot interrupts the person; if too slow, awkward pauses occur. Endpointing uses VAD (Voice Activity Detection) and timing thresholds to accurately complete recording.

Why is choosing the silence threshold so important?

The silence threshold determines how many milliseconds of silence the system waits before terminating. A value too low (e.g., 300 ms) leads to false triggers—the bot cuts off speech during a pause. A value too high (2+ seconds) makes the dialogue unnaturally slow. The optimal value depends on the scenario: for question-answering 600-800 ms, for dictation 1500-2000 ms.

Which VAD (Voice Activity Detector) is best to use?

For production we recommend Silero VAD—it shows the best accuracy on noisy data and integrates easily with PyTorch. An alternative is WebRTC VAD (lightweight, runs on CPU). In high-load projects we use Silero with ONNX Runtime to reduce latency. The choice of VAD affects false positive/negative rates, so we test on real recordings.

How does adaptive endpointing work?

Adaptive endpointing dynamically changes the silence threshold depending on the type of request. For example, on an open question ('Tell me about yourself') we wait 1200 ms, on a command ('Turn on the light')—600 ms, on yes/no—500 ms. We implement this via a request classifier (ML) or intent detector. Result: pause time reduced by 40-60% without quality loss.

What are common mistakes in implementing endpointing?

The most common is ignoring noisy environments (open office, street)—VAD may misclassify noise as speech. Second is a fixed silence threshold for all users—some speak quickly, some with pauses. Third is not accounting for context—e.g., the answer to 'Are you sure?' often starts with a pause. We solve these through adaptive ML models and noise augmentation.

What is speech endpointing (end-of-speech detection)?

It is a technology that determines when a user has finished speaking. It is critical for voice bots: if the response is too fast, the bot interrupts the person; if too slow, awkward pauses occur. Endpointing uses VAD (Voice Activity Detection) and timing thresholds to accurately complete recording.

Why is choosing the silence threshold so important?

The silence threshold determines how many milliseconds of silence the system waits before terminating. A value too low (e.g., 300 ms) leads to false triggers—the bot cuts off speech during a pause. A value too high (2+ seconds) makes the dialogue unnaturally slow. The optimal value depends on the scenario: for question-answering 600-800 ms, for dictation 1500-2000 ms.

Which VAD (Voice Activity Detector) is best to use?

For production we recommend Silero VAD—it shows the best accuracy on noisy data and integrates easily with PyTorch. An alternative is WebRTC VAD (lightweight, runs on CPU). In high-load projects we use Silero with ONNX Runtime to reduce latency. The choice of VAD affects false positive/negative rates, so we test on real recordings.

How does adaptive endpointing work?

Adaptive endpointing dynamically changes the silence threshold depending on the type of request. For example, on an open question ('Tell me about yourself') we wait 1200 ms, on a command ('Turn on the light')—600 ms, on yes/no—500 ms. We implement this via a request classifier (ML) or intent detector. Result: pause time reduced by 40-60% without quality loss.

What are common mistakes in implementing endpointing?

The most common is ignoring noisy environments (open office, street)—VAD may misclassify noise as speech. Second is a fixed silence threshold for all users—some speak quickly, some with pauses. Third is not accounting for context—e.g., the answer to 'Are you sure?' often starts with a pause. We solve these through adaptive ML models and noise augmentation.

Implementing Speech Endpointing for Voice Bots

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Implementing Speech Endpointing for Voice Bots

Medium

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1250
Website development for BELFINGROUP
956
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

We've faced this situation: a voice bot cut off the customer mid-phrase because the silence threshold was too strict. Or conversely—it hung for 3 seconds, creating awkwardness. Both cases result from poor implementation of speech endpointing (end-of-speech detection) and suboptimal VAD. In this article, we'll walk through how to configure VAD, pick thresholds, and build adaptive endpointing that works for different scenarios.

Problems We Solve

False triggers occur due to too short a silence threshold (<500 ms) or low VAD sensitivity. The user pauses, but the system already sends a request. This is especially critical in contact centers: the bot interrupts, the agent gets annoyed. The cost of such an error is a lost customer.

Missed end-of-utterance—the opposite situation: a high threshold (>1500 ms) or VAD "doesn't hear" the end of speech against background noise. The dialogue drags on, and the user loses patience. Our experience shows that 80% of issues are solved by choosing the right VAD and adapting thresholds to the scenario. Savings on re-engineering—up to 40% of budget.

Processing latency: VAD must work in real time, with p99 latency <100 ms. We use Silero VAD [Silero VAD paper] in ONNX Runtime, or WebRTC VAD (lightweight but worse in noise). For high-load systems—batching on GPU.

How to Choose the Silence Threshold for Different Scenarios

For a phone voice bot, optimal parameters: silence 600–800 ms, minimum speech 200 ms. For dictation: silence 1500–2000 ms. For smart home (quiet background): 500–600 ms. We always test on real recordings with noise. An adaptive approach delivers UX gains: on open questions, the threshold increases; on commands, it decreases.

Request Type	Silence Threshold (ms)	Example
Open question	1200	"Tell me about yourself"
Yes/No	500	"Turn on the light?"
Command	600	"Stop the music"

How We Do It: Stack and Implementation

We use Python 3.11, PyTorch 2.2, ONNX Runtime 1.17, Silero VAD v4.0. For asynchronous processing—asyncio. Here is a basic detector implementation (used in production):

import collections
import time
from enum import Enum

class SpeechState(Enum):
    SILENCE = 0
    SPEECH = 1

class EndpointDetector:
    def __init__(
        self,
        vad,
        sample_rate: int = 16000,
        frame_ms: int = 30,
        silence_threshold_ms: int = 700,  # pause for termination
        min_speech_ms: int = 300,          # minimum utterance length
    ):
        self.vad = vad
        self.sample_rate = sample_rate
        self.frame_bytes = int(sample_rate * frame_ms / 1000) * 2
        self.silence_frames_needed = silence_threshold_ms // frame_ms
        self.min_speech_frames = min_speech_ms // frame_ms

        self.state = SpeechState.SILENCE
        self.silence_counter = 0
        self.speech_buffer = bytearray()
        self.speech_frame_count = 0

    def process_frame(self, frame: bytes) -> tuple[bool, bytes | None]:
        """
        Returns: (endpoint_detected, speech_audio_or_none)
        """
        is_speech = self.vad.is_speech(frame, self.sample_rate)

        if is_speech:
            self.state = SpeechState.SPEECH
            self.silence_counter = 0
            self.speech_buffer.extend(frame)
            self.speech_frame_count += 1
        else:
            if self.state == SpeechState.SPEECH:
                self.silence_counter += 1
                self.speech_buffer.extend(frame)  # include trailing silence

                if self.silence_counter >= self.silence_frames_needed:
                    if self.speech_frame_count >= self.min_speech_frames:
                        audio = bytes(self.speech_buffer)
                        self._reset()
                        return True, audio
                    else:
                        self._reset()

        return False, None

    def _reset(self):
        self.state = SpeechState.SILENCE
        self.silence_counter = 0
        self.speech_buffer = bytearray()
        self.speech_frame_count = 0

In real dialogues, adaptive endpointing is needed. We use a classifier based on Intent Detection (e.g., via a small model like DistilBERT) that determines the request type and dynamically changes the threshold. Adaptive endpointing handles open questions 2x faster than a fixed 700 ms threshold.

# Different thresholds for different request types
THRESHOLDS = {
    "open_question": 1200,   # ms silence
    "yes_no": 500,
    "command": 600,
    "default": 700,
}

More about the adaptive classifier

The intent classifier is a lightweight model (DistilBERT or TinyBERT) that we run on the first 300 ms of audio. It predicts the request type before the user finishes speaking. This allows us to set the silence threshold in advance and reduce overall wait time. Average prediction accuracy is 94% on our data.

VAD Solutions Comparison

VAD	Accuracy on Noise	Latency (p99)	CPU Load
Silero VAD (ONNX)	0.97	50 ms	Low
WebRTC VAD	0.85	10 ms	Very low
RNNoise	0.91	30 ms	Medium

Choosing a VAD is a trade-off between accuracy and resources. For contact centers we recommend Silero, for IoT—WebRTC. Latency p99 is critical for voice bots: if it exceeds 100 ms, the dialogue becomes unnatural.

Work Process for Endpointing

Analysis—collect dialogue recordings, measure current metrics (latency, errors).
Design—select VAD (usually Silero), set threshold configuration, decide on adaptive classifier.
Implementation—integrate detector into voice stream (WebRTC or custom). Add monitoring via MLflow.
Testing—A/B test on 10% of traffic, compare with current solution.
Deployment—containerization, run on CPU nodes (Triton Inference Server). Team training.

What's Included in Turnkey Work

Documentation—architecture description, parameters, monitoring instructions.
Code—Python module with VAD, adaptive threshold, error handling.
Test bench—simulator with real recordings.
Training—call with team, Q&A.
Support—2 weeks after deployment (bug fixes, load tuning).

Timeline: basic implementation—2-3 days, adaptive with ML—1 week. Cost is calculated individually, but such an upgrade pays off in 2-3 months by reducing pauses and increasing conversion. Proper endpointing configuration can cut operational costs by 20-30%.

Our experience: over 5 years working with voice assistants, 30+ successful projects. We guarantee stable endpointing operation on noisy lines. To evaluate your project, contact us—we will analyze your recordings and offer the optimal solution.

How to Avoid Mistakes When Implementing?

Don't copy thresholds from one scenario to another: testbed must include your real audio (with noise, varying loudness).
Document metrics: latency p99, false positive rate, false negative rate. Without them you won't know if it improved.
Use an adaptive approach: even a simple threshold change by request type improves UX by 30%.

Get a consultation: contact us—we will evaluate your project and propose a solution.

Speech Recognition and Synthesis: ASR, TTS, Voice Cloning

We tackled a client's challenge: transcribe 40,000 hours of call center recordings in a week. Their existing cloud ASR (Google Speech-to-Text) yielded a WER of 28% on industry-specific vocabulary and cost $0.006 per minute — prohibitively expensive at that volume. The goal was to reduce WER below 10% and switch to self-hosted inference. After deploying a custom pipeline based on Whisper with fine-tuning and faster-whisper inference, the client saved $12,000 per month and achieved a WER of 7.3%.

How does speech recognition ASR handle noisy call center recordings?

The most common issue is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec. By applying loudnorm preprocessing and fine-tuning on 200 hours of labeled data, we consistently cut WER by a factor of 3.

Typical problems we encounter

WER does not converge to the desired metric. Often the culprit is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec.

Diarization fails with more than two speakers. pyannote/speaker-diarization-3.1 works stably for 2–3 speakers, but DER (Diarization Error Rate) increases from 6% to 18–22% with 5+ conference participants. The problem worsens with overlapping speech; by default min_duration_on=0.1 cuts short interjections. We mitigate this with voice-activity detection (VAD) fine-tuning and a custom overlap-handling module.

Voice cloning — latency vs. quality. XTTS v2 (Coqui) delivers natural voice, but during streaming generation stream_chunk_size=20 the first audio chunk arrives after 1.4–2.0 seconds — unacceptable for interactive scenarios. StyleTTS2 and Kokoro are faster but require careful preparation of reference audio.

How do we solve it in practice?

The basic stack for a production pipeline:

ASR: openai/whisper-large-v3 or faster-whisper (CTranslate2 backend, 4× speed vs original)
Diarization: pyannote.audio 3.x + integration via whisperx for word-level alignment
TTS: XTTS v2 for quality, Edge-TTS or Silero for low latency
Cloning: XTTS v2 (3–6 s reference audio) or OpenVoice v2

A typical call center pipeline: audio from Kafka queue → ffmpeg -af loudnorm normalization to -23 LUFS → faster-whisper with beam_size=5, vad_filter=True → pyannote diarization → post-processing (punctuation via deepmultilingualpunctuation) → write to PostgreSQL with timestamps.

Case study from our practice. A fintech company with 12,000 calls per day. Initial WER on Russian with banking vocabulary — 22% (Google STT). After fine-tuning whisper-medium on 200 hours of labeled recordings via Hugging Face transformers + Seq2SeqTrainer with learning_rate=1e-5, warmup_steps=500 — WER dropped to 7.3%. Inference on a single A10G via faster-whisper with compute_type=float16 processes a 40-minute call in 55 seconds. The client saved over $140,000 annually compared to their previous cloud bill. Contact us for a free pilot estimate to see similar savings on your data.

How to fine-tune Whisper on domain data?

When a general model underperforms, fine-tuning is the first tool. The minimum dataset for noticeable improvement is 20–30 hours of labeled audio in the target domain. Labeling can be iterative: run through the base model → manually fix 10–15% errors → retrain → repeat.

training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    fp16=True,
    predict_with_generate=True,
    generation_max_length=225,
)

Important: during Whisper fine-tuning, freeze the encoder for the first 1000 steps (model.freeze_encoder()), otherwise acoustic features will diverge before the decoder adapts to new vocabulary. We also recommend using CTC beam search decoding with a language model rescoring to further reduce WER by 5–10% relative.

Model	WER (clean)	WER (noisy)	RTF (A10G)	Languages
Whisper large-v3	5.2%	27%	0.08	99
Wav2Vec2-XLSR-53	6.8%	32%	0.12	143
Google STT (cloud)	7.0%	28%	–	125
DeepSpeech 0.9.3	11.5%	41%	0.06	8

Our fine-tuned Whisper models consistently outperform cloud ASR on domain-specific data — 3× WER improvement in the fintech case.

Speech synthesis: How to choose a model for your task?

Model	Latency (TTFB)	Naturalness MOS	Cloning	Languages
XTTS v2	1.2–2.0 s	4.1–4.3	Yes, 3 s reference	17
StyleTTS2	0.3–0.6 s	4.0–4.2	Yes, requires adaptation	en, + fine-tune
Kokoro-82M	0.08–0.15 s	3.7–3.9	No	en, ja
Silero TTS	0.05–0.1 s	3.4–3.6	No	ru, en, de, etc.
Edge-TTS	~0.4 s (cloud)	4.0	No	100+

For interactive bots requiring TTFB < 300 ms — Silero or Kokoro. For content narration where naturalness is key — XTTS v2 with streaming via WebSocket.

Our process and deliverables

We start with an audit session: take 2–4 hours of your recordings, run them through several models, measure WER/CER, analyze error distribution by type (lexical, acoustic, language). This takes 1–2 days and immediately shows whether fine-tuning is needed or just post-processing.

Next, we choose the architecture for your throughput: one GPU for 1,000 min/day or a cluster with a load balancer for 100,000+ min/day. Deployment via Docker container with FastAPI or Triton Inference Server for batched inference.

What you get after engagement:

Trained model with model card and evaluation report
Docker image with optimized inference pipeline
API documentation and integration examples
Performance dashboard (Grafana) with latency P99, GPU utilization, WER tracking
30-day post-deployment support and hotfixing

Timelines depend on complexity:

Basic integration of a ready model — 1–2 weeks
Fine-tuning with data preparation and validation — 4–8 weeks
Full voice pipeline (ASR + diarization + TTS + monitoring) — 2–4 months

Project investments typically range from $20,000 to $80,000. Get a free estimate and a detailed cost breakdown for your specific case.

Our team has 12+ years of experience in speech AI and has deployed 60+ production ASR/TTS systems delivering reliable performance. Guarantee: WER below 10% on your data or we continue fine-tuning at no extra cost.

Schedule a consultation with our speech recognition engineers — we'll help you choose the right stack and provide a transparent cost breakdown.