Multimodal AI Input (Text + Document) for Mobile App

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
Multimodal AI Input (Text + Document) for Mobile App
Medium
~3-5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Implementing Multimodal AI Input (Text + Document) in a Mobile Application

User opens mobile app, attaches PDF contract, asks: "What's the contract termination period?" Task looks simple, but between file_picker and model's meaningful response—dozen non-trivial solutions.

Problem Number One: How to Send Document to LLM

Most LLMs accept text, not PDF. Means conversion needed. Options:

Direct upload via Files API. OpenAI Assistants API and Gemini Files API accept PDF, DOCX, TXT directly. For mobile app—cleanest path: upload file, get file_id, insert into messages[]. Limits exist though—OpenAI has 512 MB file limit and 100 files per assistant, Files API tied to Assistants/Batch, not Chat Completions.

Text extraction on client. For PDF on Android—PdfRenderer (built-in since API 21) to render pages to Bitmap + OCR via ML Kit TextRecognizer, or Apache PDFBox port. On iOS—PDFKit + PDFPage.string for typed PDF; for scans—Vision framework with VNRecognizeTextRequest. Text goes to content[] as string.

Problem with scanned documents. PDFKit.string returns empty for PDFs from scanned pages—no text layer. ML Kit TextRecognizer handles it, but need render each page to Bitmap/CGImage and run through OCR. For 50-page document—2–5 seconds on device.

Text Extraction: Pitfalls

Android PdfRenderer requires ParcelFileDescriptor with MODE_READ_ONLY flag. If file came via content:// URI from FileProvider, need contentResolver.openFileDescriptor(). Direct File() from content:// throws FileNotFoundException—common error for unfamiliar with SAF (Storage Access Framework).

Multi-page documents process page-by-page, not loading all in memory. PdfRenderer.Page must close after each—page.close() mandatory, else IllegalStateException next iteration.

iOS PDFDocument(url:) may return nil for encrypted PDF. Handle isEncrypted and request password via UI, don't crash silently.

Architectural Solution for Large Documents

Full text of 100-page contract won't fit most models' context window—or fits but costly. Right path for large documents—RAG: split into 500–1000 token chunks with 50–100 token overlap, index in vector DB, on query fetch top-5 relevant chunks, pass only those to context.

For mobile app usually means server processing: client uploads file to backend, backend does chunking and embeddings. Client keeps UI request and response rendering. Running vector search directly on phone makes sense only for offline scenarios.

Formats and Limits

Format Android iOS OpenAI Limit
PDF (text) PdfRenderer + PDFBox PDFKit 512 MB
PDF (scan) ML Kit OCR Vision VNRecognizeTextRequest — (needs preprocessing)
DOCX Apache POI (Java) 512 MB (via Files API)
TXT / MD Native Native No limit
XLSX Apache POI 512 MB

DOCX on iOS without third-party libs—pain. Either server conversion (LibreOffice headless) or limit mobile support to PDF + TXT.

Workflow

Audit document formats in your product → choose strategy (Files API vs client extraction vs RAG) → implement file loading (file_picker, SAF, UIDocumentPickerViewController) → text conversion and cleaning → LLM integration → progress indicators for long operations → test on real documents of different quality.

Timeline: basic PDF + TXT support with direct passing—1–2 weeks. Full pipeline with OCR, multiple formats, RAG for large docs—4–6 weeks.