Implementation of speech recognition from audio files (Batch STT) Batch STT processes pre-recorded files without latency requirements, allowing you to use the highest-quality models and squeeze out maximum accuracy. Typical tasks include archival processing of call center recordings, podcast transcription, and creating subtitles for video content. ### Batch pipeline architecture
Upload → S3/Local Storage → Queue (Celery/SQS) → Worker → STT → Post-Processing → Storage
```Key design decisions: - Slicing long files into 5-10 minute segments (improves accuracy) - Parallel processing of multiple files - Retry logic for failed tasks - Storage of intermediate results ### Complete processing pipeline```python
import os
from pathlib import Path
from faster_whisper import WhisperModel
from celery import Celery
import ffmpeg
app = Celery('batch_stt', broker='redis://localhost:6379/0',
backend='redis://localhost:6379/1')
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
def convert_to_wav(input_path: str) -> str:
output_path = input_path.rsplit('.', 1)[0] + '_converted.wav'
ffmpeg.input(input_path).output(
output_path,
ar=16000,
ac=1,
acodec='pcm_s16le'
).overwrite_output().run(quiet=True)
return output_path
@app.task(bind=True, max_retries=3, time_limit=3600)
def process_audio_file(self, file_path: str, options: dict = None):
options = options or {}
try:
# Конвертация в нужный формат
wav_path = convert_to_wav(file_path)
segments, info = model.transcribe(
wav_path,
language=options.get('language'),
vad_filter=True,
word_timestamps=options.get('word_timestamps', False),
beam_size=5
)
result = {
"file": file_path,
"language": info.language,
"language_probability": info.language_probability,
"duration": info.duration,
"segments": []
}
for seg in segments:
segment_data = {
"start": round(seg.start, 3),
"end": round(seg.end, 3),
"text": seg.text.strip()
}
if options.get('word_timestamps'):
segment_data["words"] = [
{"word": w.word, "start": w.start, "end": w.end, "probability": w.probability}
for w in (seg.words or [])
]
result["segments"].append(segment_data)
os.unlink(wav_path)
return result
except Exception as exc:
raise self.retry(exc=exc, countdown=60 * (self.request.retries + 1))
```### Supported Formats We process any format with FFmpeg: MP3, WAV, FLAC, M4A, OGG, AAC, OPUS, MP4, MKV. We normalize to WAV 16kHz mono — the optimal format for ASR. ### Performance | Hardware | Model | Speed | |-------------|---------|----------| | RTX 3080 | medium (int8) | 6–8x RT | | RTX 4090 | large-v3 (int8) | 3–4x RT | | A10G | large-v3 (int8) | 4–5x RT | | CPU (16 cores) | medium | 0.3–0.5x RT |
1 hour of audio on an RTX 4090 with large-v3: ~15–20 minutes of processing. ### Implementation Timeline - Script for single files: 1 day - Pipeline with queue and API: 3–5 days - Full system with status dashboard: 1 week