Implementing Profanity Filtering in STT Profanity filtering in transcriptions is necessary for public platforms, corporate systems, and services working with children. It is implemented at several levels: in the STT engine and/or at the post-processing level. ### Built-in filters for STT providers Google STT:
config = speech.RecognitionConfig(
profanity_filter=True, # заменяет нецензурные слова на ***
language_code="ru-RU"
)
```**AWS Transcribe**:```python
transcribe.start_transcription_job(
VocabularyFilterName='profanity-filter-ru',
VocabularyFilterMethod='mask', # 'mask' | 'remove' | 'tag'
)
```**Azure Speech**:```python
speech_config.set_profanity(speechsdk.ProfanityOption.Masked)
```### Post-processing filter For engines without a built-in filter or for finer control:```python
import re
PROFANITY_LIST_RU = [...] # список слов в нормализованной форме
def filter_profanity(text: str, replacement: str = "***") -> str:
"""Фильтр с учётом морфологии через нормализацию"""
import pymorphy3
morph = pymorphy3.MorphAnalyzer()
words = text.split()
result = []
for word in words:
# Нормализуем слово (приводим к начальной форме)
parsed = morph.parse(word)
normal_form = parsed[0].normal_form if parsed else word.lower()
if normal_form in PROFANITY_LIST_RU:
result.append(replacement)
else:
result.append(word)
return " ".join(result)
```Using pymorphy3 is critical for Russian: an obscene word can be in any grammatical form. ### Logging without storage: For auditing without storing content, we log the presence of obscene language and a timestamp without the word itself. Timeframe: 1–2 days.







