AI Automatic Subtitle Generation for Video in Mobile App

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

Development and support of all types of mobile applications:

Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1All 1735 services
AI Automatic Subtitle Generation for Video in Mobile App
Medium
~3-5 days
Frequently Asked Questions

Our competencies:

Development stages

Latest works

  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    792
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    671
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1097
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    969
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    914
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    495

Implementing AI Auto-Subtitle Generation for Video in a Mobile App

AI subtitles — Whisper or analogues. Task seems simple on surface: send video, get text with timecodes. But mobile context raises several non-trivial questions: transcribe on-device or via API, how render subtitles over video, how let user edit result.

Whisper on-device vs API

Whisper via OpenAI API — simplest path. Send audio (up to 25 MB), get JSON with segments (timecodes + text):

POST https://api.openai.com/v1/audio/transcriptions
model=whisper-1&response_format=verbose_json&timestamp_granularities[]=word

verbose_json with word granularity gives timecode per word — needed for subtitle sync. Processing time: ~10-second clip — 2–4 sec, one minute video — 10–20 sec.

Whisper on-device — real for iOS 16+ via WhisperKit (swift-transformers). whisper-small model — 244 MB, ~0.3× real-time speed on iPhone 14 (i.e. one minute audio = 3 minutes processing). whisper-tiny — 77 MB, 0.7× real-time, but accuracy notably worse. Russian worse than English.

Android: whisper.cpp via JNI, or openai-whisper-tflite — but trickier to build. Simpler for most apps — API.

Extract audio from video on client

Before sending to Whisper, extract audio track — sending whole video redundant:

// iOS: AVAssetExportSession for audio extraction
func extractAudio(from videoURL: URL) async throws -> URL {
    let asset = AVURLAsset(url: videoURL)
    guard let exportSession = AVAssetExportSession(
        asset: asset, presetName: AVAssetExportPresetAppleM4A
    ) else { throw SubtitleError.exportFailed }

    let outputURL = FileManager.default.temporaryDirectory
        .appendingPathComponent(UUID().uuidString + ".m4a")
    exportSession.outputURL = outputURL
    exportSession.outputFileType = .m4a
    await exportSession.export()
    return outputURL
}

m4a/mp3 3–5× smaller than original video — faster upload and cheaper API.

Subtitle segments: on-client processing

Whisper verbose_json returns segments with start, end, text. Slice into 5–7 words per subtitle for readability:

// Android: split into subtitles
data class SubtitleCue(val start: Double, val end: Double, val text: String)

fun segmentsToSubtitles(words: List<WhisperWord>, maxWords: Int = 7): List<SubtitleCue> {
    val cues = mutableListOf<SubtitleCue>()
    var chunk = mutableListOf<WhisperWord>()

    for (word in words) {
        chunk.add(word)
        if (chunk.size >= maxWords || word.word.endsWith(".") || word.word.endsWith("!")) {
            cues.add(SubtitleCue(
                start = chunk.first().start,
                end = chunk.last().end,
                text = chunk.joinToString(" ") { it.word.trim() }
            ))
            chunk.clear()
        }
    }
    if (chunk.isNotEmpty()) {
        cues.add(SubtitleCue(chunk.first().start, chunk.last().end,
            chunk.joinToString(" ") { it.word.trim() }))
    }
    return cues
}

Render subtitles over video

Two approaches:

Overlay during playbackUILabel/TextView over AVPlayerLayer/ExoPlayer. Update text via timer from player.currentTime(). Simple, doesn't modify source file.

Burned-in subtitles — FFmpeg subtitles filter, saves subtitles as video pixels. Visible permanently in any player:

ffmpeg -i input.mp4 -vf "subtitles=subs.srt:force_style='FontSize=20,PrimaryColour=&HFFFFFF'" output.mp4

Burned-in better for sharing, overlay better for editing.

Edit corrections

Whisper makes mistakes. Let user correct:

// iOS: editable subtitle cues
@State private var subtitles: [SubtitleCue] = []
@State private var editingIndex: Int? = nil

if let index = editingIndex {
    TextField("Edit subtitle", text: $subtitles[index].text)
}

Store edits locally or sync to backend.

Timeline

Whisper API integration + audio extraction + subtitle rendering — 4–6 days. With on-device option, user editing, burned-in export — 2–3 weeks.