Implementation of adding timestamps to recognized text (Word-level Timestamps). Word-level timestamps allow for precise synchronization of text with audio. These are necessary for creating synchronized subtitles, karaoke effects, clickable transcriptions, and video keyword searches. ### Whisper with word-level timestamps
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda")
segments, _ = model.transcribe(
"audio.wav",
word_timestamps=True,
language="ru"
)
for segment in segments:
for word in segment.words:
print(f"[{word.start:.3f}s → {word.end:.3f}s] {word.word} (p={word.probability:.2f})")
```Whisper timestamp accuracy: ±50–150 ms on clean audio. ### Export to formats```python
def words_to_srt(words: list) -> str:
"""Каждое слово — отдельный субтитр (для karaoke)"""
srt = []
for i, w in enumerate(words, 1):
start = format_srt_time(w.start)
end = format_srt_time(w.end)
srt.append(f"{i}\n{start} --> {end}\n{w.word.strip()}\n")
return "\n".join(srt)
def format_srt_time(seconds: float) -> str:
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
```### Cloud providers with word timestamps - **Deepgram**: native, `timestamps=true` in parameters - **Google STT**: `enable_word_time_offsets=True` - **AWS Transcribe**: built-in by default - **AssemblyAI**: `timestamps=True` The accuracy of labels for Deepgram and Google is ±30–80 ms, better than Whisper. Timeframe: 0.5–1 day to connect to an existing STT pipeline.







