Multimodal AI Input (Text + Document) for Mobile App

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Offered services

Showing 1 of 1 servicesAll 1735 services

Multimodal AI Input (Text + Document) for Mobile App

Medium

~3-5 business days

FAQ

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
756
Development of a mobile application for XOOMER
624
Development of a mobile application for RHL
1054
Development of a mobile application for ZIPPY
947
Development of a mobile application for Affhome
862
Development of a mobile application for the FLAVORS company
445

Show more works

Implementing Multimodal AI Input (Text + Document) in a Mobile Application

User opens mobile app, attaches PDF contract, asks: "What's the contract termination period?" Task looks simple, but between file_picker and model's meaningful response—dozen non-trivial solutions.

Problem Number One: How to Send Document to LLM

Most LLMs accept text, not PDF. Means conversion needed. Options:

Direct upload via Files API. OpenAI Assistants API and Gemini Files API accept PDF, DOCX, TXT directly. For mobile app—cleanest path: upload file, get file_id, insert into messages[]. Limits exist though—OpenAI has 512 MB file limit and 100 files per assistant, Files API tied to Assistants/Batch, not Chat Completions.

Text extraction on client. For PDF on Android—PdfRenderer (built-in since API 21) to render pages to Bitmap + OCR via ML Kit TextRecognizer, or Apache PDFBox port. On iOS—PDFKit + PDFPage.string for typed PDF; for scans—Vision framework with VNRecognizeTextRequest. Text goes to content[] as string.

Problem with scanned documents. PDFKit.string returns empty for PDFs from scanned pages—no text layer. ML Kit TextRecognizer handles it, but need render each page to Bitmap/CGImage and run through OCR. For 50-page document—2–5 seconds on device.

Text Extraction: Pitfalls

Android PdfRenderer requires ParcelFileDescriptor with MODE_READ_ONLY flag. If file came via content:// URI from FileProvider, need contentResolver.openFileDescriptor(). Direct File() from content:// throws FileNotFoundException—common error for unfamiliar with SAF (Storage Access Framework).

Multi-page documents process page-by-page, not loading all in memory. PdfRenderer.Page must close after each—page.close() mandatory, else IllegalStateException next iteration.

iOS PDFDocument(url:) may return nil for encrypted PDF. Handle isEncrypted and request password via UI, don't crash silently.

Architectural Solution for Large Documents

Full text of 100-page contract won't fit most models' context window—or fits but costly. Right path for large documents—RAG: split into 500–1000 token chunks with 50–100 token overlap, index in vector DB, on query fetch top-5 relevant chunks, pass only those to context.

For mobile app usually means server processing: client uploads file to backend, backend does chunking and embeddings. Client keeps UI request and response rendering. Running vector search directly on phone makes sense only for offline scenarios.

Formats and Limits

Format	Android	iOS	OpenAI Limit
PDF (text)	PdfRenderer + PDFBox	PDFKit	512 MB
PDF (scan)	ML Kit OCR	Vision VNRecognizeTextRequest	— (needs preprocessing)
DOCX	Apache POI (Java)	—	512 MB (via Files API)
TXT / MD	Native	Native	No limit
XLSX	Apache POI	—	512 MB

DOCX on iOS without third-party libs—pain. Either server conversion (LibreOffice headless) or limit mobile support to PDF + TXT.

Workflow

Audit document formats in your product → choose strategy (Files API vs client extraction vs RAG) → implement file loading (file_picker, SAF, UIDocumentPickerViewController) → text conversion and cleaning → LLM integration → progress indicators for long operations → test on real documents of different quality.

Timeline: basic PDF + TXT support with direct passing—1–2 weeks. Full pipeline with OCR, multiple formats, RAG for large docs—4–6 weeks.