Whisper API Integration for Transcription in Mobile Applications
Whisper is not just "send audio, get text". The API has specific limitations requiring client-side preparation: 25 MB per file, specific codec support, no streaming, synchronous response. Without accounting for these details during architecture, integration becomes a series of hotfixes.
Limits and How to Work Around Them
25 MB — hard limit on POST /v1/audio/transcriptions endpoint. One minute of 128 kbps MP3 is ~1 MB, so limit is roughly 25 minutes. Fine for most voice notes, not for meeting recordings.
Solution: split on client. iOS — AVAssetExportSession with time range via AVAssetExportSession.timeRange. Android — MediaExtractor + MediaMuxer for precise slicing without re-encoding if source codec already compatible (AAC in MP4 — usually yes).
Codec. API accepts mp3, mp4, mpeg, mpga, m4a, wav, webm. Important: container, not codec inside. .m4a with AAC — passes. .m4a with ALAC — no, get 400. iOS after AVAssetExportSession with AVAssetExportPresetAppleM4A always produces AAC. Android safer to convert via MediaCodec to PCM → WAV if unsure about source.
Language. language parameter in ISO-639-1 format (ru, en, uk) speeds transcription and reduces errors. Without it, Whisper spends time detecting language and sometimes errs on short fragments.
iOS Implementation (Swift)
struct WhisperService {
private let apiKey: String
private let session = URLSession.shared
func transcribe(audioURL: URL, language: String = "ru") async throws -> String {
var request = URLRequest(url: URL(string: "https://api.openai.com/v1/audio/transcriptions")!)
request.httpMethod = "POST"
request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
let boundary = UUID().uuidString
request.setValue("multipart/form-data; boundary=\(boundary)", forHTTPHeaderField: "Content-Type")
var body = Data()
// Append file
body.append("--\(boundary)\r\n".data(using: .utf8)!)
body.append("Content-Disposition: form-data; name=\"file\"; filename=\"audio.m4a\"\r\n".data(using: .utf8)!)
body.append("Content-Type: audio/m4a\r\n\r\n".data(using: .utf8)!)
body.append(try Data(contentsOf: audioURL))
body.append("\r\n".data(using: .utf8)!)
// Append model and language
body.append("--\(boundary)\r\n".data(using: .utf8)!)
body.append("Content-Disposition: form-data; name=\"model\"\r\n\r\nwhisper-1\r\n".data(using: .utf8)!)
body.append("--\(boundary)\r\n".data(using: .utf8)!)
body.append("Content-Disposition: form-data; name=\"language\"\r\n\r\n\(language)\r\n".data(using: .utf8)!)
body.append("--\(boundary)--\r\n".data(using: .utf8)!)
request.httpBody = body
let (data, _) = try await session.data(for: request)
let response = try JSONDecoder().decode(TranscriptionResponse.self, from: data)
return response.text
}
}
For files larger than 25 MB — before calling transcribe run AudioChunker.split(url:maxBytes:), get array of URLs, run transcribe in parallel via TaskGroup, merge by index order.
Android Implementation (Kotlin)
suspend fun transcribe(file: File, language: String = "ru"): String {
val client = OkHttpClient.Builder()
.readTimeout(120, TimeUnit.SECONDS)
.build()
val requestBody = MultipartBody.Builder()
.setType(MultipartBody.FORM)
.addFormDataPart("file", file.name, file.asRequestBody("audio/mp4".toMediaType()))
.addFormDataPart("model", "whisper-1")
.addFormDataPart("language", language)
.build()
val request = Request.Builder()
.url("https://api.openai.com/v1/audio/transcriptions")
.header("Authorization", "Bearer $apiKey")
.post(requestBody)
.build()
return withContext(Dispatchers.IO) {
client.newCall(request).execute().use { response ->
val json = response.body!!.string()
JSONObject(json).getString("text")
}
}
}
Note: readTimeout — 120 seconds minimum. Whisper on long file responds slowly. Default OkHttp 10 seconds guarantees SocketTimeoutException.
Parameters Often Ignored
response_format: verbose_json — returns not just text, but segments with start, end, text. Needed for audio-text sync, time search, subtitles.
prompt — up to 224 tokens context, hints model on style and domain words. Pass domain terminology: "TZ, MVP, backlog, Jira" for IT meetings, "ECG, BP, history" for medicine. Really reduces errors on specific terms.
temperature: 0 — deterministic output. Better for production than default.
Common Integration Mistakes
Loading Data(contentsOf:) whole to memory before sending — on 100 MB file causes OOM on budget Android. Use file.asRequestBody() in OkHttp or InputStream-based upload in iOS via URLSession.uploadTask(withStreamedRequest:).
No retry logic. Whisper API periodically returns 503 Service Unavailable under load. Exponential backoff with 3 attempts covers 99% of cases.
Storing API key in code or BuildConfig. Key must come via backend — mobile client shouldn't have direct OpenAI API access in production.
Timeline and Process
Basic Whisper integration (record → transcribe → output text) on one platform — 3–5 days. Adding chunking, verbose_json with timestamps, retry logic, background processing via WorkManager/BackgroundTasks — another 5–8 days. Multi-language and audio-text sync UI — separate phase.







