Implementing Multimodal AI Input (Text + Image) in a Mobile Application
When user photographs product label and wants immediate composition decoding—this is multimodal input. Not "upload photo, then type question in another field," but unified flow: photo and context go to model in one request. Implementing this correctly is harder than it seems at start.
Where First Prototypes Break
Most common mistake—send image in separate request, get text description, then concatenate with user question. This is not multimodality, it's two-call chain with context loss. GPT-4o, Claude 3, Gemini 1.5 support image_url directly in messages[]—use that.
Android common problem: Bitmap after BitmapFactory.decodeFile() on large camera shot weighs 15–20 MB. Base64 from such image balloons to 25+ MB, API returns 400 Bad Request with vague image_too_large. Solution—scale via Bitmap.createScaledBitmap() to 1024×1024 or use BitmapRegionDecoder for crop before sending. JPEG 85% compression usually sufficient.
iOS same story, different pitfalls: UIImagePickerController returns UIImage with imageOrientation != .up rotation, model receives image upside down. ImageIO or CGImagePropertyOrientation must apply before base64 encoding—otherwise text recognition degrades.
How Real Integration Is Built
Protocol exchange. OpenAI-compatible format (messages with content type array) works with most providers. Build abstraction MultimodalMessage that packs List<ContentPart>—text, image, optionally document—into one payload. Allows switching provider (OpenAI → Anthropic → Google) by replacing one adapter.
// Android (Kotlin)
data class ImagePart(val base64: String, val mimeType: String = "image/jpeg")
data class TextPart(val text: String)
fun buildPayload(text: String, bitmap: Bitmap): RequestBody {
val scaled = Bitmap.createScaledBitmap(bitmap, 1024, 1024, true)
val stream = ByteArrayOutputStream()
scaled.compress(Bitmap.CompressFormat.JPEG, 85, stream)
val b64 = Base64.encodeToString(stream.toByteArray(), Base64.NO_WRAP)
// pack into messages[]
}
Response streaming. For long answers (medical image analysis, invoice breakdown), stream: true with Server-Sent Events gives user sense of live response. On Android—OkHttp with EventSource, on iOS—URLSession + AsyncSequence. Without streaming, analyzing dense document, user stares at blank screen 8–12 seconds.
Cache and repeat requests. If user sent same image with different question—no need to re-encode. Cache base64 string by Bitmap hash (MD5 of pixel array or file Uri) in LruCache for 10–20 MB. On iOS—NSCache with similar logic.
Complexities at UX and Architecture Level
Camera and gallery permissions split on Android 13+: READ_MEDIA_IMAGES instead of old READ_EXTERNAL_STORAGE. On iOS—NSPhotoLibraryUsageDescription and NSCameraUsageDescription in Info.plist, and since iOS 14, PHPickerViewController works without requesting full library access. Don't use UIImagePickerController for new projects—Apple will deprecate it.
Many teams underestimate model error handling. If image is blurry, too dark, or contains forbidden content—provider returns finish_reason: content_filter or empty content. UI must distinguish this and give user clear feedback, not infinite loading indicator.
Stack and Tools
| Component | Android | iOS |
|---|---|---|
| Image Capture | CameraX 1.3+ | AVFoundation / PHPickerViewController |
| Encoding | Base64 (java.util) |
Data.base64EncodedString() |
| HTTP Client | OkHttp 4 + Retrofit | URLSession / Alamofire |
| Streaming | OkHttp EventSource | AsyncStream / Combine |
| Cache | LruCache / Coil | NSCache / Kingfisher |
Flutter: image_picker → dart:convert (base64Encode) → http or dio with chunked streaming. Architecturally—provider or BLoC for load/streaming state management.
Workflow Stages
Audit current app architecture and choose AI provider → design MultimodalMessage protocol and provider abstraction → implement capture, encoding, send → integrate streaming response render → test edge cases (portrait/landscape, HDR, large files) → load testing (parallel requests, cancellation, reconnect) → release and monitor via Firebase Crashlytics + custom events.
Timeline: MVP with basic text+image input—1–2 weeks. Full implementation with streaming, caching, error handling, multi-provider support—3–5 weeks depending on existing codebase.







