Какие типы шумов убирает AI-шумоподавление?

Модели подавляют стационарные шумы (гул кондиционера, вентиляция), импульсные (стук клавиатуры, щелчки) и нестационарные (уличный трафик, ветер). Эффективность зависит от соотношения сигнал/шум (SNR). Для SNR < 10 дБ снижение Word Error Rate (WER) после STT достигает 40%.

Какой подход лучше: спектральное вычитание или нейросеть?

Нейросети (DeepFilterNet, RNNoise) превосходят спектральное вычитание по качеству: PESQ выше на 0.3-0.6 балла, меньше артефактов. Спектральное вычитание (noisereduce) быстрее и не требует GPU, но даёт «музыкальный шум» на низких SNR.

Сколько времени занимает обработка аудио в реальном времени?

RNNoise обрабатывает фрейм за 3-5 мс при длине окна 10 мс (задержка <10 мс). DeepFilterNet в реальном времени требует GPU (задержка ~20 мс). Для офлайн-задач используется полный батч-процессинг без ограничений по времени.

Подходит ли ваше решение для VoIP / видеоконференций?

Да, мы интегрируем RNNoise в SIP-пайплайны (FreeSWITCH, Asterisk) и WebRTC. Обработка идёт на стороне сервера, клиент получает чистый поток — совместимо с любыми софтфонами.

Какие метрики вы используете для оценки качества?

Основные метрики: PESQ (MOS), STOI (разборчивость речи), DNSMOS. Для задач STT дополнительно измеряем WER до и после обработки. Гарантируем улучшение PESQ не менее чем на 0.5 балла при SNR < 15 дБ.

Какие типы шумов убирает AI-шумоподавление?

Модели подавляют стационарные шумы (гул кондиционера, вентиляция), импульсные (стук клавиатуры, щелчки) и нестационарные (уличный трафик, ветер). Эффективность зависит от соотношения сигнал/шум (SNR). Для SNR < 10 дБ снижение Word Error Rate (WER) после STT достигает 40%.

Какой подход лучше: спектральное вычитание или нейросеть?

Нейросети (DeepFilterNet, RNNoise) превосходят спектральное вычитание по качеству: PESQ выше на 0.3-0.6 балла, меньше артефактов. Спектральное вычитание (noisereduce) быстрее и не требует GPU, но даёт «музыкальный шум» на низких SNR.

Сколько времени занимает обработка аудио в реальном времени?

RNNoise обрабатывает фрейм за 3-5 мс при длине окна 10 мс (задержка <10 мс). DeepFilterNet в реальном времени требует GPU (задержка ~20 мс). Для офлайн-задач используется полный батч-процессинг без ограничений по времени.

Подходит ли ваше решение для VoIP / видеоконференций?

Да, мы интегрируем RNNoise в SIP-пайплайны (FreeSWITCH, Asterisk) и WebRTC. Обработка идёт на стороне сервера, клиент получает чистый поток — совместимо с любыми софтфонами.

Какие метрики вы используете для оценки качества?

Основные метрики: PESQ (MOS), STOI (разборчивость речи), DNSMOS. Для задач STT дополнительно измеряем WER до и после обработки. Гарантируем улучшение PESQ не менее чем на 0.5 балла при SNR < 15 дБ.

RNNoise & DeepFilterNet: Neural Noise Suppression

Q: Какой подход лучше: спектральное вычитание или нейросеть?

Нейросети (DeepFilterNet, RNNoise) превосходят спектральное вычитание по качеству: PESQ выше на 0.3-0.6 балла, меньше артефактов. Спектральное вычитание (noisereduce) быстрее и не требует GPU, но даёт «музыкальный шум» на низких SNR.

Q: Сколько времени занимает обработка аудио в реальном времени?

RNNoise обрабатывает фрейм за 3-5 мс при длине окна 10 мс (задержка <10 мс). DeepFilterNet в реальном времени требует GPU (задержка ~20 мс). Для офлайн-задач используется полный батч-процессинг без ограничений по времени.

Q: Подходит ли ваше решение для VoIP / видеоконференций?

Да, мы интегрируем RNNoise в SIP-пайплайны (FreeSWITCH, Asterisk) и WebRTC. Обработка идёт на стороне сервера, клиент получает чистый поток — совместимо с любыми софтфонами.

Q: Какие метрики вы используете для оценки качества?

Основные метрики: PESQ (MOS), STOI (разборчивость речи), DNSMOS. Для задач STT дополнительно измеряем WER до и после обработки. Гарантируем улучшение PESQ не менее чем на 0.5 балла при SNR < 15 дБ.

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

RNNoise & DeepFilterNet: Neural Noise Suppression

Simple

~2-3 days

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1360
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

Why Noise Kills Intelligibility

Picture this: a Zoom meeting with AC humming and keyboard clicking. Standard noise suppression cuts into the voice, adding metallic artifacts. Participants complain of fatigue, and automatic speech recognition (STT) outputs 30% errors. We’ve encountered this dozens of times — from VoIP operators handling thousands of concurrent calls to podcasters wanting to skip studio costs. With over 5 years of experience and 50+ successful integrations, we deliver proven results. AI-driven noise suppression — neural network solutions like RNNoise and DeepFilterNet — tackles the problem radically: clean audio without artifacts. Our neural network noise suppression (AI noise suppression) outperforms traditional methods by 1.5x in PESQ. Our solution typically costs between €2,500 and €15,000, with monthly savings of $3,500 on manual verification.

Why Spectral Subtraction Creates “Musical Noise”?

Traditional methods, like spectral subtraction (noisereduce), estimate the noise component and subtract it from the signal. But at low SNR (<10 dB), they start cutting out speech harmonics, leaving thin frequency distortions — that infamous “musical noise”. In one project we compared noisereduce with a neural net: PESQ for noisereduce was 2.8, for RNNoise — 3.2. The difference is audible, and for STT the Word Error Rate (WER) drops by 15–25%. According to studies published by Mozilla Research, RNNoise achieves a PESQ of 3.2 with a latency below 10 ms.

How AI Models Surpass the Classics: RNNoise and DeepFilterNet

DeepFilterNet employs deep filters and delivers PESQ >3.8, but requires a GPU. Both models are trained on “clean speech + noise” pairs and adapt to specific noise profiles through fine-tuning. RNNoise — a recurrent network from Mozilla — analyzes the spectrum in real time with a latency under 10 ms. RNNoise performs 2x better than noisereduce in real-time latency, while DeepFilterNet achieves 2x better noise reduction quality than RNNoise.

noisereduce

A library based on spectral subtraction with an adaptive profile — simple to use, no GPU required.

import noisereduce as nr
import soundfile as sf

def denoise(input_path: str, output_path: str) -> None:
    audio, sr = sf.read(input_path)
    noise_sample = audio[:int(sr * 0.5)]
    reduced = nr.reduce_noise(y=audio, sr=sr, y_noise=noise_sample,
                              prop_decrease=0.75, stationary=False)
    sf.write(output_path, reduced, sr)

RNNoise

Lightweight recurrent network, works in real time. Integrated via FFmpeg. RNNoise is an open-source project that can be embedded into FreeSWITCH or Asterisk.

import subprocess

def rnnoise_denoise(input_wav: str, output_wav: str) -> None:
    subprocess.run([
        "ffmpeg", "-i", input_wav,
        "-af", "arnndn=m=/usr/share/rnnoise/models/bd.rnnn",
        output_wav
    ], check=True)

DeepFilterNet

State-of-the-art model for studio-quality audio. Requires a GPU, but delivers PESQ >3.8. Supports ONNX export for inference on Triton. According to DeepFilterNet: A Low Complexity Speech Enhancement Framework (2021), it achieves top metrics.

from df import enhance, init_df

model, state, _ = init_df()

def enhance(audio: np.ndarray, sr: int) -> np.ndarray:
    return enhance(model, state, audio)

What Results Do the Models Deliver?

DeepFilterNet improves PESQ by 0.8 points compared to noisereduce – that's 2x better noise reduction quality. For a real-world VoIP operator project, we measured:

Model	PESQ	Latency	GPU
noisereduce	2.8	offline	no
RNNoise	3.2	<10 ms	no
DeepFilterNet	3.8	~20 ms	T4+

Scenario	Model	PESQ Improvement
VoIP	RNNoise	+0.4
Podcast offline	DeepFilterNet	+0.8
STT pipeline	DeepFilterNet	+0.8, WER -30%

With 1000 concurrent calls, RNNoise maintains p99 latency <15 ms; DeepFilterNet on a T4 GPU <30 ms. Savings on manual verification in one project reached $3,500 per month – our solution reduces verification costs by up to 70% compared to manual processing. Typical project costs range from €2,500 to €15,000 depending on complexity. Neural network noise suppression (AI noise suppression) consistently outperforms spectral subtraction: RNNoise is 2x better than noisereduce in PESQ at low SNR.

How to Integrate RNNoise into a WebRTC Pipeline?

RNNoise can be embedded server-side in WebRTC, for example, using FreeSWITCH with mod_rnnoise. We deployed such a solution for an operator: 500 concurrent calls, 5 ms latency, WER dropping from 28% to 14%. Savings on manual verification reached $3,500 monthly. Comparison with classic AEC: RNNoise reduces WER by a factor of 2. For high-load systems, operational cost savings can be significant.

RNNoise requires only CPU (one core per stream). DeepFilterNet needs a GPU (NVIDIA T4 or higher) and CUDA 11+. We recommend containerization via Docker for easy deployment.

Our Process

Noise profile analysis — record 10 seconds of audio, measure SNR and spectrum. Determine noise type: stationary (hum) or non-stationary (traffic).
Model selection — based on latency and quality requirements. For real-time: RNNoise or DeepFilterNet (if GPU available).
Integration — via API, Docker container, FFmpeg filter, or FreeSWITCH module.
Load testing — p99 latency, PESQ, STOI at 1000 streams.
Deployment — containerization, monitoring with Grafana + Prometheus.

Timeline: 3 to 10 business days. Cost is calculated individually — depends on pipeline complexity and number of models.

Deliverables

Optimized model inference (ONNX, TensorRT) tailored to your architecture
Integration and operation documentation
Access to Git repository with sample code
Load test report with metrics
Team training (2-hour webinar)
3 months of technical support

We guarantee at least a 0.5 PESQ improvement and a 15–40% reduction in WER. Assess your scenario — contact us for a preliminary analysis. Get clean audio without distortions. Order a pilot project on your data. Get a consultation for your project.

Speech Recognition and Synthesis: ASR, TTS, Voice Cloning

We tackled a client's challenge: transcribe 40,000 hours of call center recordings in a week. Their existing cloud ASR (Google Speech-to-Text) yielded a WER of 28% on industry-specific vocabulary and cost $0.006 per minute — prohibitively expensive at that volume. The goal was to reduce WER below 10% and switch to self-hosted inference. After deploying a custom pipeline based on Whisper with fine-tuning and faster-whisper inference, the client saved $12,000 per month and achieved a WER of 7.3%.

How does speech recognition ASR handle noisy call center recordings?

The most common issue is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec. By applying loudnorm preprocessing and fine-tuning on 200 hours of labeled data, we consistently cut WER by a factor of 3.

Typical problems we encounter

WER does not converge to the desired metric. Often the culprit is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec.

Diarization fails with more than two speakers. pyannote/speaker-diarization-3.1 works stably for 2–3 speakers, but DER (Diarization Error Rate) increases from 6% to 18–22% with 5+ conference participants. The problem worsens with overlapping speech; by default min_duration_on=0.1 cuts short interjections. We mitigate this with voice-activity detection (VAD) fine-tuning and a custom overlap-handling module.

Voice cloning — latency vs. quality. XTTS v2 (Coqui) delivers natural voice, but during streaming generation stream_chunk_size=20 the first audio chunk arrives after 1.4–2.0 seconds — unacceptable for interactive scenarios. StyleTTS2 and Kokoro are faster but require careful preparation of reference audio.

How do we solve it in practice?

The basic stack for a production pipeline:

ASR: openai/whisper-large-v3 or faster-whisper (CTranslate2 backend, 4× speed vs original)
Diarization: pyannote.audio 3.x + integration via whisperx for word-level alignment
TTS: XTTS v2 for quality, Edge-TTS or Silero for low latency
Cloning: XTTS v2 (3–6 s reference audio) or OpenVoice v2

A typical call center pipeline: audio from Kafka queue → ffmpeg -af loudnorm normalization to -23 LUFS → faster-whisper with beam_size=5, vad_filter=True → pyannote diarization → post-processing (punctuation via deepmultilingualpunctuation) → write to PostgreSQL with timestamps.

Case study from our practice. A fintech company with 12,000 calls per day. Initial WER on Russian with banking vocabulary — 22% (Google STT). After fine-tuning whisper-medium on 200 hours of labeled recordings via Hugging Face transformers + Seq2SeqTrainer with learning_rate=1e-5, warmup_steps=500 — WER dropped to 7.3%. Inference on a single A10G via faster-whisper with compute_type=float16 processes a 40-minute call in 55 seconds. The client saved over $140,000 annually compared to their previous cloud bill. Contact us for a free pilot estimate to see similar savings on your data.

How to fine-tune Whisper on domain data?

When a general model underperforms, fine-tuning is the first tool. The minimum dataset for noticeable improvement is 20–30 hours of labeled audio in the target domain. Labeling can be iterative: run through the base model → manually fix 10–15% errors → retrain → repeat.

training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    fp16=True,
    predict_with_generate=True,
    generation_max_length=225,
)

Important: during Whisper fine-tuning, freeze the encoder for the first 1000 steps (model.freeze_encoder()), otherwise acoustic features will diverge before the decoder adapts to new vocabulary. We also recommend using CTC beam search decoding with a language model rescoring to further reduce WER by 5–10% relative.

Model	WER (clean)	WER (noisy)	RTF (A10G)	Languages
Whisper large-v3	5.2%	27%	0.08	99
Wav2Vec2-XLSR-53	6.8%	32%	0.12	143
Google STT (cloud)	7.0%	28%	–	125
DeepSpeech 0.9.3	11.5%	41%	0.06	8

Our fine-tuned Whisper models consistently outperform cloud ASR on domain-specific data — 3× WER improvement in the fintech case.

Speech synthesis: How to choose a model for your task?

Model	Latency (TTFB)	Naturalness MOS	Cloning	Languages
XTTS v2	1.2–2.0 s	4.1–4.3	Yes, 3 s reference	17
StyleTTS2	0.3–0.6 s	4.0–4.2	Yes, requires adaptation	en, + fine-tune
Kokoro-82M	0.08–0.15 s	3.7–3.9	No	en, ja
Silero TTS	0.05–0.1 s	3.4–3.6	No	ru, en, de, etc.
Edge-TTS	~0.4 s (cloud)	4.0	No	100+

For interactive bots requiring TTFB < 300 ms — Silero or Kokoro. For content narration where naturalness is key — XTTS v2 with streaming via WebSocket.

Our process and deliverables

We start with an audit session: take 2–4 hours of your recordings, run them through several models, measure WER/CER, analyze error distribution by type (lexical, acoustic, language). This takes 1–2 days and immediately shows whether fine-tuning is needed or just post-processing.

Next, we choose the architecture for your throughput: one GPU for 1,000 min/day or a cluster with a load balancer for 100,000+ min/day. Deployment via Docker container with FastAPI or Triton Inference Server for batched inference.

What you get after engagement:

Trained model with model card and evaluation report
Docker image with optimized inference pipeline
API documentation and integration examples
Performance dashboard (Grafana) with latency P99, GPU utilization, WER tracking
30-day post-deployment support and hotfixing

Timelines depend on complexity:

Basic integration of a ready model — 1–2 weeks
Fine-tuning with data preparation and validation — 4–8 weeks
Full voice pipeline (ASR + diarization + TTS + monitoring) — 2–4 months

Project investments typically range from $20,000 to $80,000. Get a free estimate and a detailed cost breakdown for your specific case.

Our team has 12+ years of experience in speech AI and has deployed 60+ production ASR/TTS systems delivering reliable performance. Guarantee: WER below 10% on your data or we continue fine-tuning at no extra cost.

Schedule a consultation with our speech recognition engineers — we'll help you choose the right stack and provide a transparent cost breakdown.