Multimodal AI Input (Text + Image) for Mobile App

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
Multimodal AI Input (Text + Image) for Mobile App
Medium
~3-5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Implementing Multimodal AI Input (Text + Image) in a Mobile Application

When user photographs product label and wants immediate composition decoding—this is multimodal input. Not "upload photo, then type question in another field," but unified flow: photo and context go to model in one request. Implementing this correctly is harder than it seems at start.

Where First Prototypes Break

Most common mistake—send image in separate request, get text description, then concatenate with user question. This is not multimodality, it's two-call chain with context loss. GPT-4o, Claude 3, Gemini 1.5 support image_url directly in messages[]—use that.

Android common problem: Bitmap after BitmapFactory.decodeFile() on large camera shot weighs 15–20 MB. Base64 from such image balloons to 25+ MB, API returns 400 Bad Request with vague image_too_large. Solution—scale via Bitmap.createScaledBitmap() to 1024×1024 or use BitmapRegionDecoder for crop before sending. JPEG 85% compression usually sufficient.

iOS same story, different pitfalls: UIImagePickerController returns UIImage with imageOrientation != .up rotation, model receives image upside down. ImageIO or CGImagePropertyOrientation must apply before base64 encoding—otherwise text recognition degrades.

How Real Integration Is Built

Protocol exchange. OpenAI-compatible format (messages with content type array) works with most providers. Build abstraction MultimodalMessage that packs List<ContentPart>—text, image, optionally document—into one payload. Allows switching provider (OpenAI → Anthropic → Google) by replacing one adapter.

// Android (Kotlin)
data class ImagePart(val base64: String, val mimeType: String = "image/jpeg")
data class TextPart(val text: String)

fun buildPayload(text: String, bitmap: Bitmap): RequestBody {
    val scaled = Bitmap.createScaledBitmap(bitmap, 1024, 1024, true)
    val stream = ByteArrayOutputStream()
    scaled.compress(Bitmap.CompressFormat.JPEG, 85, stream)
    val b64 = Base64.encodeToString(stream.toByteArray(), Base64.NO_WRAP)
    // pack into messages[]
}

Response streaming. For long answers (medical image analysis, invoice breakdown), stream: true with Server-Sent Events gives user sense of live response. On Android—OkHttp with EventSource, on iOS—URLSession + AsyncSequence. Without streaming, analyzing dense document, user stares at blank screen 8–12 seconds.

Cache and repeat requests. If user sent same image with different question—no need to re-encode. Cache base64 string by Bitmap hash (MD5 of pixel array or file Uri) in LruCache for 10–20 MB. On iOS—NSCache with similar logic.

Complexities at UX and Architecture Level

Camera and gallery permissions split on Android 13+: READ_MEDIA_IMAGES instead of old READ_EXTERNAL_STORAGE. On iOS—NSPhotoLibraryUsageDescription and NSCameraUsageDescription in Info.plist, and since iOS 14, PHPickerViewController works without requesting full library access. Don't use UIImagePickerController for new projects—Apple will deprecate it.

Many teams underestimate model error handling. If image is blurry, too dark, or contains forbidden content—provider returns finish_reason: content_filter or empty content. UI must distinguish this and give user clear feedback, not infinite loading indicator.

Stack and Tools

Component Android iOS
Image Capture CameraX 1.3+ AVFoundation / PHPickerViewController
Encoding Base64 (java.util) Data.base64EncodedString()
HTTP Client OkHttp 4 + Retrofit URLSession / Alamofire
Streaming OkHttp EventSource AsyncStream / Combine
Cache LruCache / Coil NSCache / Kingfisher

Flutter: image_pickerdart:convert (base64Encode) → http or dio with chunked streaming. Architecturally—provider or BLoC for load/streaming state management.

Workflow Stages

Audit current app architecture and choose AI provider → design MultimodalMessage protocol and provider abstraction → implement capture, encoding, send → integrate streaming response render → test edge cases (portrait/landscape, HDR, large files) → load testing (parallel requests, cancellation, reconnect) → release and monitor via Firebase Crashlytics + custom events.

Timeline: MVP with basic text+image input—1–2 weeks. Full implementation with streaming, caching, error handling, multi-provider support—3–5 weeks depending on existing codebase.