Implementing a custom vocabulary for the STT system. A custom vocabulary is the fastest way to improve recognition of specific terms, names, and abbreviations without retraining the model. It acts as a hint to the STT engine: "pay special attention to these words." ### Implementation for the main AWS Transcribe Custom Vocabulary providers:
import boto3
transcribe = boto3.client('transcribe')
# Создаём словарь из файла (S3)
transcribe.create_vocabulary(
VocabularyName='corporate-terms-v1',
LanguageCode='ru-RU',
VocabularyFileUri='s3://my-bucket/vocabulary.txt'
)
# Формат файла vocabulary.txt:
# Phrase\tSoundsLike\tIPA\tDisplayAs
# Б-Ф-И-О\tбэ эф и о\t\tБФИО
# ИНН\tин эн эн\t\tИНН
```**Azure Custom Speech**:```python
# Добавляем domain adaptation data через Azure Portal или REST API
# Поддерживает: pronunciation dictionary, phrase list
import requests
phrase_list = {
"kind": "PhraseList",
"locale": "ru-RU",
"phrases": ["ОГРН", "СНИЛС", "КПП", "расчётный счёт"]
}
```**faster-whisper with hints via initial prompt**:```python
model = WhisperModel("large-v3", device="cuda")
# Начальный промпт помогает модели ориентироваться на нужную лексику
initial_prompt = "ИНН, ОГРН, СНИЛС, КПП, расчётный счёт, генеральный директор."
segments, _ = model.transcribe(
audio,
initial_prompt=initial_prompt,
language="ru"
)
```The initial_prompt method works unreliably for long files—the prompt is processed only for the first window. ### Dictionary maintenance - Dictionary versioning (v1, v2...) - Automatic updates when new terms appear - A/B testing of versions on representative audio Timeframe: 1–2 days for basic integration, including dictionary filling.







