Speaker Diarization Implementation. Diarization is the task of "who spoke when" without prior knowledge of the voices. It is necessary for transcribing meetings, interviews, court hearings—anywhere where each line needs to be attributed to a specific speaker. ### The modern pyannote.audio 3.x stack is a state-of-the-art open-source solution with a DER (Diarization Error Rate) of 7–12% on standard datasets:
from pyannote.audio import Pipeline
import torch
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="HF_TOKEN"
)
pipeline.to(torch.device("cuda"))
diarization = pipeline(
"meeting.wav",
min_speakers=2,
max_speakers=6
)
for segment, track, speaker in diarization.itertracks(yield_label=True):
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {speaker}")
```### Merging diarization with transcription```python
from faster_whisper import WhisperModel
def transcribe_with_diarization(audio_path: str) -> list[dict]:
# 1. Транскрибируем
whisper = WhisperModel("large-v3", device="cuda")
segments, _ = whisper.transcribe(audio_path, word_timestamps=True)
# 2. Диаризуем
diarization = pipeline(audio_path)
# 3. Сопоставляем по временным меткам
result = []
for seg in segments:
seg_midpoint = (seg.start + seg.end) / 2
speaker = "UNKNOWN"
for turn, _, spk in diarization.itertracks(yield_label=True):
if turn.start <= seg_midpoint <= turn.end:
speaker = spk
break
result.append({
"speaker": speaker,
"start": seg.start,
"end": seg.end,
"text": seg.text
})
return result
```### Speaker Quality | Speakers | DER (pyannote 3.1) | |----------------|-------------------| | 2 | 5–8% | | 4 | 8–12% | | 6 | 12–18% | | 8+ | 15–25% | ### Cloud Alternatives - AssemblyAI Diarization: $0.012/min, up to 10 speakers - Google STT: $0.008/min, up to 6 speakers
- AWS Transcribe: $0.029/min, up to 10 speakers. Timeframe: Pyannote + Whisper integration — 3–5 days. Optimization for a specific recording type — up to 2 weeks.







