End-of-Speech Detection (Endpointing) Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
End-of-Speech Detection (Endpointing) Implementation
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Implementation of End-of-Speech Detection Endpointing is the process of determining when the user has finished speaking and the system should process their request. This is critical for voice bots: too fast a response interrupts the user, while too slow creates awkward pauses. ### Endpointing Algorithm

import collections
import time
from enum import Enum

class SpeechState(Enum):
    SILENCE = 0
    SPEECH = 1

class EndpointDetector:
    def __init__(
        self,
        vad,
        sample_rate: int = 16000,
        frame_ms: int = 30,
        silence_threshold_ms: int = 700,  # пауза для завершения
        min_speech_ms: int = 300,          # минимальная длина высказывания
    ):
        self.vad = vad
        self.sample_rate = sample_rate
        self.frame_bytes = int(sample_rate * frame_ms / 1000) * 2
        self.silence_frames_needed = silence_threshold_ms // frame_ms
        self.min_speech_frames = min_speech_ms // frame_ms

        self.state = SpeechState.SILENCE
        self.silence_counter = 0
        self.speech_buffer = bytearray()
        self.speech_frame_count = 0

    def process_frame(self, frame: bytes) -> tuple[bool, bytes | None]:
        """
        Returns: (endpoint_detected, speech_audio_or_none)
        """
        is_speech = self.vad.is_speech(frame, self.sample_rate)

        if is_speech:
            self.state = SpeechState.SPEECH
            self.silence_counter = 0
            self.speech_buffer.extend(frame)
            self.speech_frame_count += 1
        else:
            if self.state == SpeechState.SPEECH:
                self.silence_counter += 1
                self.speech_buffer.extend(frame)  # включаем финальную тишину

                if self.silence_counter >= self.silence_frames_needed:
                    if self.speech_frame_count >= self.min_speech_frames:
                        audio = bytes(self.speech_buffer)
                        self._reset()
                        return True, audio
                    else:
                        self._reset()

        return False, None

    def _reset(self):
        self.state = SpeechState.SILENCE
        self.silence_counter = 0
        self.speech_buffer = bytearray()
        self.speech_frame_count = 0
```### Adaptive endpointing In real-world dialogues, contextual endpointing is needed: when answering an open-ended question, we wait longer, when confirming, we wait less:```python
# Разные пороги для разных типов запросов
THRESHOLDS = {
    "open_question": 1200,   # мс тишины
    "yes_no": 500,
    "command": 600,
    "default": 700,
}
```### Practical parameters: For a telephone voice bot, the optimal parameters are: silence 600–800 ms, minimum speech 200 ms. For dictation: silence 1500–2000 ms. Timeframe: basic endpoint implementation — 2–3 days. Adaptive endpoint with ML prediction — 1 week.