Implementation of automatic transcription of court hearings Transcription of court hearings is a highly demanding task with accuracy requirements of >98% (WER <2%). Specific features: multiple speakers, interruptions, legal vocabulary, procedural formulas, proper names. The transcript is a procedural document. ### System requirements - WER <5% on legal vocabulary (after post-processing) - Accurate attribution of remarks (chairman, prosecutor, lawyer, witness, defendant) - Timestamps for each remark - Automatic normalization of numerals, dates, articles of law - Secure storage (data does not leave the circuit) ### On-premise architecture
# Полностью локальное развёртывание без облака
class CourtTranscriptionSystem:
def __init__(self):
# Whisper large-v3 дообученный на юридических данных
self.stt = WhisperModel(
"/models/whisper-legal-ru",
device="cuda",
compute_type="int8_float16"
)
# Диаризация — обязательно
self.diarizer = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1"
)
# Нормализатор юридических текстов
self.normalizer = LegalTextNormalizer()
async def transcribe_session(self, audio_path: str, participants: dict) -> dict:
"""
participants: {"SPEAKER_00": "Председатель Иванова И.И.", ...}
"""
# 1. Транскрибируем с word timestamps
segments, _ = self.stt.transcribe(
audio_path,
word_timestamps=True,
language="ru",
vad_filter=True,
initial_prompt="Судебное заседание. Председатель суда, прокурор, адвокат, подсудимый."
)
# 2. Диаризация
diarization = self.diarizer(audio_path)
# 3. Сопоставляем с участниками
labeled_transcript = self._label_speakers(
list(segments), diarization, participants
)
# 4. Нормализация: "сто пятьдесят вторая статья" → "ст. 152"
for segment in labeled_transcript:
segment["text"] = self.normalizer.normalize(segment["text"])
return {
"session_date": datetime.now().isoformat(),
"transcript": labeled_transcript,
"metadata": {
"audio_duration": self._get_duration(audio_path),
"speaker_count": len(participants)
}
}
```### Specialized dictionary for the court```python
LEGAL_VOCABULARY = [
"апелляционное определение",
"постановление о прекращении дела",
"кассационная жалоба",
"статья двести шестьдесят четвёртая",
"часть первая статьи",
"Уголовно-процессуальный кодекс",
"Гражданский процессуальный кодекс",
# ... несколько тысяч терминов
]
```### Export to formats - **DOCX** with table: time | speaking | text - **XML** for court electronic document management systems - **PDF** with signature and hash for integrity ### Post-editing The system is designed to **support the secretary**, not replace: it produces a draft with >90% accuracy, the secretary corrects and certifies. ### Implementation time - Basic system: 4–6 weeks
- With additional training on legal data: +4–6 weeks - Certification and integration into the State Automated System “Justice”: a separate project