Batch Speech Recognition from Audio Files Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Offered services

Showing 1 of 1 servicesAll 1566 services

Medium

from 1 business day to 3 business days

FAQ

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1228
Development of a web application for FEEDME
1166
Website development for BELFINGROUP
863
Development of an online store for the company FURNORO
1074
B2B Advance company logo design
561
Development of a web application for Enviok
829

Show more works

Implementation of speech recognition from audio files (Batch STT) Batch STT processes pre-recorded files without latency requirements, allowing you to use the highest-quality models and squeeze out maximum accuracy. Typical tasks include archival processing of call center recordings, podcast transcription, and creating subtitles for video content. ### Batch pipeline architecture

Upload → S3/Local Storage → Queue (Celery/SQS) → Worker → STT → Post-Processing → Storage
```Key design decisions: - Slicing long files into 5-10 minute segments (improves accuracy) - Parallel processing of multiple files - Retry logic for failed tasks - Storage of intermediate results ### Complete processing pipeline```python
import os
from pathlib import Path
from faster_whisper import WhisperModel
from celery import Celery
import ffmpeg

app = Celery('batch_stt', broker='redis://localhost:6379/0',
             backend='redis://localhost:6379/1')
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")

def convert_to_wav(input_path: str) -> str:
    output_path = input_path.rsplit('.', 1)[0] + '_converted.wav'
    ffmpeg.input(input_path).output(
        output_path,
        ar=16000,
        ac=1,
        acodec='pcm_s16le'
    ).overwrite_output().run(quiet=True)
    return output_path

@app.task(bind=True, max_retries=3, time_limit=3600)
def process_audio_file(self, file_path: str, options: dict = None):
    options = options or {}
    try:
        # Конвертация в нужный формат
        wav_path = convert_to_wav(file_path)

        segments, info = model.transcribe(
            wav_path,
            language=options.get('language'),
            vad_filter=True,
            word_timestamps=options.get('word_timestamps', False),
            beam_size=5
        )

        result = {
            "file": file_path,
            "language": info.language,
            "language_probability": info.language_probability,
            "duration": info.duration,
            "segments": []
        }

        for seg in segments:
            segment_data = {
                "start": round(seg.start, 3),
                "end": round(seg.end, 3),
                "text": seg.text.strip()
            }
            if options.get('word_timestamps'):
                segment_data["words"] = [
                    {"word": w.word, "start": w.start, "end": w.end, "probability": w.probability}
                    for w in (seg.words or [])
                ]
            result["segments"].append(segment_data)

        os.unlink(wav_path)
        return result

    except Exception as exc:
        raise self.retry(exc=exc, countdown=60 * (self.request.retries + 1))
```### Supported Formats We process any format with FFmpeg: MP3, WAV, FLAC, M4A, OGG, AAC, OPUS, MP4, MKV. We normalize to WAV 16kHz mono — the optimal format for ASR. ### Performance | Hardware | Model | Speed | |-------------|---------|----------| | RTX 3080 | medium (int8) | 6–8x RT | | RTX 4090 | large-v3 (int8) | 3–4x RT | | A10G | large-v3 (int8) | 4–5x RT | | CPU (16 cores) | medium | 0.3–0.5x RT |
1 hour of audio on an RTX 4090 with large-v3: ~15–20 minutes of processing. ### Implementation Timeline - Script for single files: 1 day - Pipeline with queue and API: 3–5 days - Full system with status dashboard: 1 week