Word-Level Timestamps in Recognized Text Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Word-Level Timestamps in Recognized Text Implementation
Simple
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Implementation of adding timestamps to recognized text (Word-level Timestamps). Word-level timestamps allow for precise synchronization of text with audio. These are necessary for creating synchronized subtitles, karaoke effects, clickable transcriptions, and video keyword searches. ### Whisper with word-level timestamps

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")

segments, _ = model.transcribe(
    "audio.wav",
    word_timestamps=True,
    language="ru"
)

for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.3f}s → {word.end:.3f}s] {word.word} (p={word.probability:.2f})")
```Whisper timestamp accuracy: ±50–150 ms on clean audio. ### Export to formats```python
def words_to_srt(words: list) -> str:
    """Каждое слово — отдельный субтитр (для karaoke)"""
    srt = []
    for i, w in enumerate(words, 1):
        start = format_srt_time(w.start)
        end = format_srt_time(w.end)
        srt.append(f"{i}\n{start} --> {end}\n{w.word.strip()}\n")
    return "\n".join(srt)

def format_srt_time(seconds: float) -> str:
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
```### Cloud providers with word timestamps - **Deepgram**: native, `timestamps=true` in parameters - **Google STT**: `enable_word_time_offsets=True` - **AWS Transcribe**: built-in by default - **AssemblyAI**: `timestamps=True` The accuracy of labels for Deepgram and Google is ±30–80 ms, better than Whisper. Timeframe: 0.5–1 day to connect to an existing STT pipeline.