Speaker Identification Implementation Speaker Identification is the process of identifying a speaker from a database of known voices. Unlike diarization ("who spoke when"), it requires answering "who is this person?" It is used in authentication systems, personalized assistants, and broadcast monitoring. ### System Architecture
Audio → VAD → Speaker Encoder → Embedding → Similarity Search → Identity
(ECAPA-TDNN) (d-vector) (cosine / ANN)
```### Removing speaker embeddings```python
from speechbrain.pretrained import SpeakerRecognition
import torchaudio
import torch
# ECAPA-TDNN — state-of-the-art архитектура
model = SpeakerRecognition.from_hparams(
source="speechbrain/spkrec-ecapa-voxceleb",
savedir="tmp_spkrec"
)
def get_embedding(audio_path: str) -> torch.Tensor:
signal, sr = torchaudio.load(audio_path)
if sr != 16000:
signal = torchaudio.functional.resample(signal, sr, 16000)
embedding = model.encode_batch(signal)
return embedding.squeeze()
# Регистрация нового говорящего
def register_speaker(name: str, audio_samples: list[str]):
embeddings = [get_embedding(p) for p in audio_samples]
mean_embedding = torch.stack(embeddings).mean(0)
return mean_embedding # сохраняем в базу
```### Search the voice database```python
import faiss
import numpy as np
# Индекс для быстрого поиска (миллионы голосов)
index = faiss.IndexFlatIP(192) # cosine similarity через inner product
speaker_names = []
def add_speaker(name: str, embedding: torch.Tensor):
emb_np = embedding.numpy().reshape(1, -1)
faiss.normalize_L2(emb_np)
index.add(emb_np)
speaker_names.append(name)
def identify_speaker(audio_path: str, threshold: float = 0.75) -> str:
embedding = get_embedding(audio_path).numpy().reshape(1, -1)
faiss.normalize_L2(embedding)
distances, indices = index.search(embedding, k=1)
score = float(distances[0][0])
if score >= threshold:
return speaker_names[indices[0][0]]
return "UNKNOWN"
```### ECAPA-TDNN EER (Equal Error Rate) performance on VoxCeleb1: 0.87% — industrial-level. When using 10+ seconds of recording for registration: accuracy >95% at a threshold of 0.8. ### Implementation timeline Basic identification system: 1 week. With FAISS index and vote database management: 2 weeks.







