Implementing Multimodal AI Input (Text + Document) in a Mobile Application
User opens mobile app, attaches PDF contract, asks: "What's the contract termination period?" Task looks simple, but between file_picker and model's meaningful response—dozen non-trivial solutions.
Problem Number One: How to Send Document to LLM
Most LLMs accept text, not PDF. Means conversion needed. Options:
Direct upload via Files API. OpenAI Assistants API and Gemini Files API accept PDF, DOCX, TXT directly. For mobile app—cleanest path: upload file, get file_id, insert into messages[]. Limits exist though—OpenAI has 512 MB file limit and 100 files per assistant, Files API tied to Assistants/Batch, not Chat Completions.
Text extraction on client. For PDF on Android—PdfRenderer (built-in since API 21) to render pages to Bitmap + OCR via ML Kit TextRecognizer, or Apache PDFBox port. On iOS—PDFKit + PDFPage.string for typed PDF; for scans—Vision framework with VNRecognizeTextRequest. Text goes to content[] as string.
Problem with scanned documents. PDFKit.string returns empty for PDFs from scanned pages—no text layer. ML Kit TextRecognizer handles it, but need render each page to Bitmap/CGImage and run through OCR. For 50-page document—2–5 seconds on device.
Text Extraction: Pitfalls
Android PdfRenderer requires ParcelFileDescriptor with MODE_READ_ONLY flag. If file came via content:// URI from FileProvider, need contentResolver.openFileDescriptor(). Direct File() from content:// throws FileNotFoundException—common error for unfamiliar with SAF (Storage Access Framework).
Multi-page documents process page-by-page, not loading all in memory. PdfRenderer.Page must close after each—page.close() mandatory, else IllegalStateException next iteration.
iOS PDFDocument(url:) may return nil for encrypted PDF. Handle isEncrypted and request password via UI, don't crash silently.
Architectural Solution for Large Documents
Full text of 100-page contract won't fit most models' context window—or fits but costly. Right path for large documents—RAG: split into 500–1000 token chunks with 50–100 token overlap, index in vector DB, on query fetch top-5 relevant chunks, pass only those to context.
For mobile app usually means server processing: client uploads file to backend, backend does chunking and embeddings. Client keeps UI request and response rendering. Running vector search directly on phone makes sense only for offline scenarios.
Formats and Limits
| Format | Android | iOS | OpenAI Limit |
|---|---|---|---|
| PDF (text) | PdfRenderer + PDFBox | PDFKit | 512 MB |
| PDF (scan) | ML Kit OCR | Vision VNRecognizeTextRequest | — (needs preprocessing) |
| DOCX | Apache POI (Java) | — | 512 MB (via Files API) |
| TXT / MD | Native | Native | No limit |
| XLSX | Apache POI | — | 512 MB |
DOCX on iOS without third-party libs—pain. Either server conversion (LibreOffice headless) or limit mobile support to PDF + TXT.
Workflow
Audit document formats in your product → choose strategy (Files API vs client extraction vs RAG) → implement file loading (file_picker, SAF, UIDocumentPickerViewController) → text conversion and cleaning → LLM integration → progress indicators for long operations → test on real documents of different quality.
Timeline: basic PDF + TXT support with direct passing—1–2 weeks. Full pipeline with OCR, multiple formats, RAG for large docs—4–6 weeks.







