Whisper API Transcription Integration for Mobile App

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
Whisper API Transcription Integration for Mobile App
Simple
from 1 business day to 3 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Whisper API Integration for Transcription in Mobile Applications

Whisper is not just "send audio, get text". The API has specific limitations requiring client-side preparation: 25 MB per file, specific codec support, no streaming, synchronous response. Without accounting for these details during architecture, integration becomes a series of hotfixes.

Limits and How to Work Around Them

25 MB — hard limit on POST /v1/audio/transcriptions endpoint. One minute of 128 kbps MP3 is ~1 MB, so limit is roughly 25 minutes. Fine for most voice notes, not for meeting recordings.

Solution: split on client. iOS — AVAssetExportSession with time range via AVAssetExportSession.timeRange. Android — MediaExtractor + MediaMuxer for precise slicing without re-encoding if source codec already compatible (AAC in MP4 — usually yes).

Codec. API accepts mp3, mp4, mpeg, mpga, m4a, wav, webm. Important: container, not codec inside. .m4a with AAC — passes. .m4a with ALAC — no, get 400. iOS after AVAssetExportSession with AVAssetExportPresetAppleM4A always produces AAC. Android safer to convert via MediaCodec to PCM → WAV if unsure about source.

Language. language parameter in ISO-639-1 format (ru, en, uk) speeds transcription and reduces errors. Without it, Whisper spends time detecting language and sometimes errs on short fragments.

iOS Implementation (Swift)

struct WhisperService {
    private let apiKey: String
    private let session = URLSession.shared

    func transcribe(audioURL: URL, language: String = "ru") async throws -> String {
        var request = URLRequest(url: URL(string: "https://api.openai.com/v1/audio/transcriptions")!)
        request.httpMethod = "POST"
        request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")

        let boundary = UUID().uuidString
        request.setValue("multipart/form-data; boundary=\(boundary)", forHTTPHeaderField: "Content-Type")

        var body = Data()
        // Append file
        body.append("--\(boundary)\r\n".data(using: .utf8)!)
        body.append("Content-Disposition: form-data; name=\"file\"; filename=\"audio.m4a\"\r\n".data(using: .utf8)!)
        body.append("Content-Type: audio/m4a\r\n\r\n".data(using: .utf8)!)
        body.append(try Data(contentsOf: audioURL))
        body.append("\r\n".data(using: .utf8)!)
        // Append model and language
        body.append("--\(boundary)\r\n".data(using: .utf8)!)
        body.append("Content-Disposition: form-data; name=\"model\"\r\n\r\nwhisper-1\r\n".data(using: .utf8)!)
        body.append("--\(boundary)\r\n".data(using: .utf8)!)
        body.append("Content-Disposition: form-data; name=\"language\"\r\n\r\n\(language)\r\n".data(using: .utf8)!)
        body.append("--\(boundary)--\r\n".data(using: .utf8)!)

        request.httpBody = body
        let (data, _) = try await session.data(for: request)
        let response = try JSONDecoder().decode(TranscriptionResponse.self, from: data)
        return response.text
    }
}

For files larger than 25 MB — before calling transcribe run AudioChunker.split(url:maxBytes:), get array of URLs, run transcribe in parallel via TaskGroup, merge by index order.

Android Implementation (Kotlin)

suspend fun transcribe(file: File, language: String = "ru"): String {
    val client = OkHttpClient.Builder()
        .readTimeout(120, TimeUnit.SECONDS)
        .build()

    val requestBody = MultipartBody.Builder()
        .setType(MultipartBody.FORM)
        .addFormDataPart("file", file.name, file.asRequestBody("audio/mp4".toMediaType()))
        .addFormDataPart("model", "whisper-1")
        .addFormDataPart("language", language)
        .build()

    val request = Request.Builder()
        .url("https://api.openai.com/v1/audio/transcriptions")
        .header("Authorization", "Bearer $apiKey")
        .post(requestBody)
        .build()

    return withContext(Dispatchers.IO) {
        client.newCall(request).execute().use { response ->
            val json = response.body!!.string()
            JSONObject(json).getString("text")
        }
    }
}

Note: readTimeout — 120 seconds minimum. Whisper on long file responds slowly. Default OkHttp 10 seconds guarantees SocketTimeoutException.

Parameters Often Ignored

response_format: verbose_json — returns not just text, but segments with start, end, text. Needed for audio-text sync, time search, subtitles.

prompt — up to 224 tokens context, hints model on style and domain words. Pass domain terminology: "TZ, MVP, backlog, Jira" for IT meetings, "ECG, BP, history" for medicine. Really reduces errors on specific terms.

temperature: 0 — deterministic output. Better for production than default.

Common Integration Mistakes

Loading Data(contentsOf:) whole to memory before sending — on 100 MB file causes OOM on budget Android. Use file.asRequestBody() in OkHttp or InputStream-based upload in iOS via URLSession.uploadTask(withStreamedRequest:).

No retry logic. Whisper API periodically returns 503 Service Unavailable under load. Exponential backoff with 3 attempts covers 99% of cases.

Storing API key in code or BuildConfig. Key must come via backend — mobile client shouldn't have direct OpenAI API access in production.

Timeline and Process

Basic Whisper integration (record → transcribe → output text) on one platform — 3–5 days. Adding chunking, verbose_json with timestamps, retry logic, background processing via WorkManager/BackgroundTasks — another 5–8 days. Multi-language and audio-text sync UI — separate phase.