Which models do you use for audio separation?

We apply Demucs v4 (htdemucs) for general separation, MDX-Net for maximum quality, and Spleeter for fast batch processing. The choice depends on the task: vocal isolation, drum separation, or remastering.

How long does it take to process a standard-length track?

On GPU, processing one song (3–4 minutes) takes 2 to 10 seconds depending on the model. In batch mode on CPU, it takes up to a minute per track.

Can you separate speech and music in video?

Yes, we integrate separation into the video processing pipeline. We extract the audio track, separate speech from background, then feed it into STT or replace dubbing. A typical scenario reduces WER from 18% to 4%.

What formats and bitrates are supported?

We support WAV, MP3, FLAC, M4A, OGG with any bitrate. For maximum quality, we recommend uncompressed WAV 44.1 kHz, stereo.

Do you need to retrain the model from scratch?

No, we use pretrained models and fine-tuning only for specific scenarios, such as archival recordings with narrow frequency bands. In most cases, the solutions work out of the box.

Which models do you use for audio separation?

We apply Demucs v4 (htdemucs) for general separation, MDX-Net for maximum quality, and Spleeter for fast batch processing. The choice depends on the task: vocal isolation, drum separation, or remastering.

How long does it take to process a standard-length track?

On GPU, processing one song (3–4 minutes) takes 2 to 10 seconds depending on the model. In batch mode on CPU, it takes up to a minute per track.

Can you separate speech and music in video?

Yes, we integrate separation into the video processing pipeline. We extract the audio track, separate speech from background, then feed it into STT or replace dubbing. A typical scenario reduces WER from 18% to 4%.

What formats and bitrates are supported?

We support WAV, MP3, FLAC, M4A, OGG with any bitrate. For maximum quality, we recommend uncompressed WAV 44.1 kHz, stereo.

Do you need to retrain the model from scratch?

No, we use pretrained models and fine-tuning only for specific scenarios, such as archival recordings with narrow frequency bands. In most cases, the solutions work out of the box.

AI Audio Separation Integration: Demucs, MDX-Net, Spleeter

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI Audio Separation Integration: Demucs, MDX-Net, Spleeter

Medium

~2-3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1360
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI Audio Source Separation Integration for Business

Imagine you have a concert recording where vocals blend with guitar and drums. Making a clean karaoke track without AI is impossible. Old methods (ICA, NMF) produce artifacts, and manual processing takes hours. We solve this with modern neural networks. Our experience: 5+ years in audio processing, over 30 projects implementing source separation in media and music production. We guarantee separation quality and adherence to your deadlines.

Source separation—extracting individual sound sources from a mixed signal. It is used in music production (stems), speech processing (removing background music), video post-production, and remastering archival recordings.

What Problems We Solve

Low separation quality. Old methods (ICA, NMF) produce strong artifacts. Modern deep learning models—Demucs, Spleeter, MDX-Net—achieve SDR > 9 dB, meaning clean separation without noticeable noise.

Processing speed. For batch processing of hundreds of tracks, performance is critical. Spleeter runs 100× faster than real-time on GPU, Demucs at 1.5×. We optimize pipelines for your hardware.

Integration into existing workflows. API on FastAPI, batch processing via queues, support for popular formats—we implement it all turnkey.

How to Choose an Audio Separation Model

Choice depends on three factors: target stems, required quality, and speed.

Model Comparison Table

Model	Separation Type	Quality (SDR)	Speed
Demucs v4 (htdemucs)	Vocals/drums/bass/other	9.0 dB	1.5× realtime on GPU
Spleeter (Deezer)	2/4/5 stems	6.8 dB	100× realtime
Open-Unmix (UMX)	4 stems	7.2 dB	10× realtime
MDX-Net	Competition (MDX Challenge)	9.5 dB	2× realtime
BS-RoFormer	SOTA	10.1 dB	0.8× realtime

SDR (Signal-to-Distortion Ratio) is the main metric: higher = cleaner separation.

Why Demucs v4 is the Best Choice for Production

Demucs v4 (htdemucs) offers the best balance of quality and speed among open-source solutions. It is trained on large datasets and works reliably across genres. Below is a comparison of models by latency (processing time for 1 minute of audio on an A100 GPU):

Model	Latency (sec)	VRAM Usage
Demucs v4	0.8	2.1 GB
MDX-Net	1.2	3.8 GB
Spleeter	0.1	1.0 GB

For production, we recommend Demucs v4 or its lightweight version htdemucs_ft.

What Integration of Demucs into Your Pipeline Provides

We use Demucs v4 in production. Below is an example inference class:

import torch
from demucs.pretrained import get_model
from demucs.apply import apply_model
from demucs.audio import AudioFile, save_audio
import torchaudio

class AudioSourceSeparator:
    def __init__(self, model_name: str = "htdemucs"):
        self.model = get_model(model_name)
        self.model.eval()
        if torch.cuda.is_available():
            self.model.cuda()

    def separate(
        self,
        audio_path: str,
        output_dir: str,
        stems: list[str] = None  # None = all stems
    ) -> dict[str, str]:
        """Separate track into stems, return file paths"""
        wav = AudioFile(audio_path).read(
            streams=0,
            samplerate=self.model.samplerate,
            channels=self.model.audio_channels
        )
        ref = wav.mean(0)
        wav = (wav - ref.mean()) / ref.std()

        sources = apply_model(
            self.model,
            wav[None],
            device="cuda" if torch.cuda.is_available() else "cpu",
            progress=True,
            num_workers=2
        )[0]
        sources = sources * ref.std() + ref.mean()

        result = {}
        available_stems = self.model.sources  # ['drums', 'bass', 'other', 'vocals']
        target_stems = stems or available_stems

        for stem, source in zip(available_stems, sources):
            if stem in target_stems:
                output_path = f"{output_dir}/{stem}.wav"
                save_audio(source, output_path, samplerate=self.model.samplerate)
                result[stem] = output_path

        return result

Separating Speech from Background Music

For content processing, we use the htdemucs_ft model (fine-tuned on vocals). Example:

from demucs.pretrained import get_model

class SpeechFromMusicExtractor:
    """Extract speech from video"""

    def __init__(self):
        self.model = get_model("htdemucs_ft")

    async def process_video_audio(
        self,
        video_path: str,
        output_speech: str,
        output_music: str
    ) -> dict:
        import subprocess
        audio_path = video_path.replace(".mp4", "_audio.wav")
        subprocess.run([
            "ffmpeg", "-i", video_path,
            "-ac", "2", "-ar", "44100",
            "-vn", audio_path
        ], check=True)
        stems = self.separate(audio_path, output_dir="/tmp/stems")
        speech_stems = ["vocals"]
        music_stems = ["drums", "bass", "other"]
        return {
            "speech": stems.get("vocals"),
            "music_components": {k: stems[k] for k in music_stems if k in stems}
        }

Our Work Process

Task analysis – determine target stems, quality and speed requirements.
Model selection – choose optimal architecture (Demucs, MDX-Net, Spleeter).
Integration – embed the model into your pipeline (API, batch, real-time).
Testing – evaluate metrics (SDR, WER) on your data.
Deployment – deploy on GPU/CPU, set up monitoring.

What's Included in the Work

Model selection and adaptation.
API or batch handler implementation.
Operational documentation.
Team training (1–2 hours).
One month of technical support.

Typical Applications

Music production: remixing—isolate drums or bass for rework; karaoke—remove vocals, keep instrumental; mastering stems—process each layer independently.

Content and media: remove background music before STT—WER drops from 18% to 4%; remaster archival recordings—separation + denoise each stem; video localization—isolate speech, replace with dubbing.

Post-production: ADR (Automated Dialogue Replacement)—clean vocals for replacing lines; music scoring—extract music for reuse.

Limitations and Nuances

Demucs performs worse when:

Very loud percussion over speech (SDR drops 2–3 dB).
Low-quality mono recordings (< 22 kHz).
Complex polyphonic overlays (4+ sources simultaneously).

For maximum vocal quality, use mdx_extra or htdemucs_ft. For speed in batch mode, use Spleeter (10–15× faster than Demucs on CPU).

Timeline and Cost

Integration of Demucs into a media file processing pipeline takes 1–2 weeks. A full service with queue and web interface takes 3–4 weeks. Basic integration starts at $1,200; full service with queue and web interface starts at $3,500. Pricing is determined individually after analyzing your requirements. Contact us for a project evaluation. Get a consultation on AI separation implementation—we'll help you choose the right solution for your tasks.

Sources

Demucs: Hybrid Spectrogram and Waveform Source Separation, Defossez et al., 2021
Spleeter: A Fast and State-of-the-Art Music Source Separation Tool, Hennequin et al., 2019
MDX-Net: Music Demixing Challenge, 2021

Speech Recognition and Synthesis: ASR, TTS, Voice Cloning

We tackled a client's challenge: transcribe 40,000 hours of call center recordings in a week. Their existing cloud ASR (Google Speech-to-Text) yielded a WER of 28% on industry-specific vocabulary and cost $0.006 per minute — prohibitively expensive at that volume. The goal was to reduce WER below 10% and switch to self-hosted inference. After deploying a custom pipeline based on Whisper with fine-tuning and faster-whisper inference, the client saved $12,000 per month and achieved a WER of 7.3%.

How does speech recognition ASR handle noisy call center recordings?

The most common issue is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec. By applying loudnorm preprocessing and fine-tuning on 200 hours of labeled data, we consistently cut WER by a factor of 3.

Typical problems we encounter

WER does not converge to the desired metric. Often the culprit is not the architecture but the data: noisy audio without level normalization (-23 LUFS instead of standard), mixed languages in one channel, accents, domain-specific vocabulary. Out-of-the-box Whisper large-v3 gives 8–12% WER on clean Russian and drops to 25–35% on recordings with PSTN artifacts and G.711 narrowband codec.

Diarization fails with more than two speakers. pyannote/speaker-diarization-3.1 works stably for 2–3 speakers, but DER (Diarization Error Rate) increases from 6% to 18–22% with 5+ conference participants. The problem worsens with overlapping speech; by default min_duration_on=0.1 cuts short interjections. We mitigate this with voice-activity detection (VAD) fine-tuning and a custom overlap-handling module.

Voice cloning — latency vs. quality. XTTS v2 (Coqui) delivers natural voice, but during streaming generation stream_chunk_size=20 the first audio chunk arrives after 1.4–2.0 seconds — unacceptable for interactive scenarios. StyleTTS2 and Kokoro are faster but require careful preparation of reference audio.

How do we solve it in practice?

The basic stack for a production pipeline:

ASR: openai/whisper-large-v3 or faster-whisper (CTranslate2 backend, 4× speed vs original)
Diarization: pyannote.audio 3.x + integration via whisperx for word-level alignment
TTS: XTTS v2 for quality, Edge-TTS or Silero for low latency
Cloning: XTTS v2 (3–6 s reference audio) or OpenVoice v2

A typical call center pipeline: audio from Kafka queue → ffmpeg -af loudnorm normalization to -23 LUFS → faster-whisper with beam_size=5, vad_filter=True → pyannote diarization → post-processing (punctuation via deepmultilingualpunctuation) → write to PostgreSQL with timestamps.

Case study from our practice. A fintech company with 12,000 calls per day. Initial WER on Russian with banking vocabulary — 22% (Google STT). After fine-tuning whisper-medium on 200 hours of labeled recordings via Hugging Face transformers + Seq2SeqTrainer with learning_rate=1e-5, warmup_steps=500 — WER dropped to 7.3%. Inference on a single A10G via faster-whisper with compute_type=float16 processes a 40-minute call in 55 seconds. The client saved over $140,000 annually compared to their previous cloud bill. Contact us for a free pilot estimate to see similar savings on your data.

How to fine-tune Whisper on domain data?

When a general model underperforms, fine-tuning is the first tool. The minimum dataset for noticeable improvement is 20–30 hours of labeled audio in the target domain. Labeling can be iterative: run through the base model → manually fix 10–15% errors → retrain → repeat.

training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    fp16=True,
    predict_with_generate=True,
    generation_max_length=225,
)

Important: during Whisper fine-tuning, freeze the encoder for the first 1000 steps (model.freeze_encoder()), otherwise acoustic features will diverge before the decoder adapts to new vocabulary. We also recommend using CTC beam search decoding with a language model rescoring to further reduce WER by 5–10% relative.

Model	WER (clean)	WER (noisy)	RTF (A10G)	Languages
Whisper large-v3	5.2%	27%	0.08	99
Wav2Vec2-XLSR-53	6.8%	32%	0.12	143
Google STT (cloud)	7.0%	28%	–	125
DeepSpeech 0.9.3	11.5%	41%	0.06	8

Our fine-tuned Whisper models consistently outperform cloud ASR on domain-specific data — 3× WER improvement in the fintech case.

Speech synthesis: How to choose a model for your task?

Model	Latency (TTFB)	Naturalness MOS	Cloning	Languages
XTTS v2	1.2–2.0 s	4.1–4.3	Yes, 3 s reference	17
StyleTTS2	0.3–0.6 s	4.0–4.2	Yes, requires adaptation	en, + fine-tune
Kokoro-82M	0.08–0.15 s	3.7–3.9	No	en, ja
Silero TTS	0.05–0.1 s	3.4–3.6	No	ru, en, de, etc.
Edge-TTS	~0.4 s (cloud)	4.0	No	100+

For interactive bots requiring TTFB < 300 ms — Silero or Kokoro. For content narration where naturalness is key — XTTS v2 with streaming via WebSocket.

Our process and deliverables

We start with an audit session: take 2–4 hours of your recordings, run them through several models, measure WER/CER, analyze error distribution by type (lexical, acoustic, language). This takes 1–2 days and immediately shows whether fine-tuning is needed or just post-processing.

Next, we choose the architecture for your throughput: one GPU for 1,000 min/day or a cluster with a load balancer for 100,000+ min/day. Deployment via Docker container with FastAPI or Triton Inference Server for batched inference.

What you get after engagement:

Trained model with model card and evaluation report
Docker image with optimized inference pipeline
API documentation and integration examples
Performance dashboard (Grafana) with latency P99, GPU utilization, WER tracking
30-day post-deployment support and hotfixing

Timelines depend on complexity:

Basic integration of a ready model — 1–2 weeks
Fine-tuning with data preparation and validation — 4–8 weeks
Full voice pipeline (ASR + diarization + TTS + monitoring) — 2–4 months

Project investments typically range from $20,000 to $80,000. Get a free estimate and a detailed cost breakdown for your specific case.

Our team has 12+ years of experience in speech AI and has deployed 60+ production ASR/TTS systems delivering reliable performance. Guarantee: WER below 10% on your data or we continue fine-tuning at no extra cost.

Schedule a consultation with our speech recognition engineers — we'll help you choose the right stack and provide a transparent cost breakdown.