Implementation of speech recognition in a noisy environment (Noise Robust STT) Standard STT models degrade at SNR below 10 dB: WER increases from 8% to 30–60%. Noise Robust STT solves the problem through audio preprocessing and the use of noise-robust models. ### Preprocessing pipeline
import torch
import torchaudio
from denoiser import pretrained
# Facebook Denoiser — state-of-the-art шумоподавление
denoiser_model = pretrained.dns64()
def denoise_audio(audio_path: str) -> torch.Tensor:
waveform, sr = torchaudio.load(audio_path)
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
with torch.no_grad():
denoised = denoiser_model(waveform.unsqueeze(0))[0]
return denoised.squeeze(0)
```### Noise Reduction Tools | Tool | Type | Quality | Latency | |-----------|-----|---------| | Facebook Denoiser | DNN | High | 50-100 ms | | RNNoise | RNN | Good | <10 ms | | DeepFilterNet | DNN | High | 20-50 ms | | Speex DSP | DSP | Medium | <5 ms | | noisereduce (scipy) | Stat | Medium | — | For real-time: RNNoise or DeepFilterNet. For batch: Facebook Denoiser. ### Whisper with VAD filtering
Whisper tends to hallucinate in noisy areas. The VAD filter in faster-whisper cuts off noisy segments:```python
segments, _ = model.transcribe(
audio,
vad_filter=True,
vad_parameters={
"threshold": 0.5,
"min_speech_duration_ms": 250,
"min_silence_duration_ms": 2000,
"speech_pad_ms": 400
}
)
```### Testing on noisy data. We use the Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) test and the PESQ metric to evaluate the quality after noise reduction. The target PESQ is > 3.0 for comfortable listening. Timeframe: Basic noise reduction + STT — 3–4 days. Optimized pipeline for a specific noise type — 1–2 weeks.







