Implementation of End-of-Speech Detection Endpointing is the process of determining when the user has finished speaking and the system should process their request. This is critical for voice bots: too fast a response interrupts the user, while too slow creates awkward pauses. ### Endpointing Algorithm
import collections
import time
from enum import Enum
class SpeechState(Enum):
SILENCE = 0
SPEECH = 1
class EndpointDetector:
def __init__(
self,
vad,
sample_rate: int = 16000,
frame_ms: int = 30,
silence_threshold_ms: int = 700, # пауза для завершения
min_speech_ms: int = 300, # минимальная длина высказывания
):
self.vad = vad
self.sample_rate = sample_rate
self.frame_bytes = int(sample_rate * frame_ms / 1000) * 2
self.silence_frames_needed = silence_threshold_ms // frame_ms
self.min_speech_frames = min_speech_ms // frame_ms
self.state = SpeechState.SILENCE
self.silence_counter = 0
self.speech_buffer = bytearray()
self.speech_frame_count = 0
def process_frame(self, frame: bytes) -> tuple[bool, bytes | None]:
"""
Returns: (endpoint_detected, speech_audio_or_none)
"""
is_speech = self.vad.is_speech(frame, self.sample_rate)
if is_speech:
self.state = SpeechState.SPEECH
self.silence_counter = 0
self.speech_buffer.extend(frame)
self.speech_frame_count += 1
else:
if self.state == SpeechState.SPEECH:
self.silence_counter += 1
self.speech_buffer.extend(frame) # включаем финальную тишину
if self.silence_counter >= self.silence_frames_needed:
if self.speech_frame_count >= self.min_speech_frames:
audio = bytes(self.speech_buffer)
self._reset()
return True, audio
else:
self._reset()
return False, None
def _reset(self):
self.state = SpeechState.SILENCE
self.silence_counter = 0
self.speech_buffer = bytearray()
self.speech_frame_count = 0
```### Adaptive endpointing In real-world dialogues, contextual endpointing is needed: when answering an open-ended question, we wait longer, when confirming, we wait less:```python
# Разные пороги для разных типов запросов
THRESHOLDS = {
"open_question": 1200, # мс тишины
"yes_no": 500,
"command": 600,
"default": 700,
}
```### Practical parameters: For a telephone voice bot, the optimal parameters are: silence 600–800 ms, minimum speech 200 ms. For dictation: silence 1500–2000 ms. Timeframe: basic endpoint implementation — 2–3 days. Adaptive endpoint with ML prediction — 1 week.







