How does AI handle long documents?

We use a map-reduce approach: split the text into chunks of 2000-3000 tokens with overlap, summarize each chunk in parallel, then merge the results into a final summary. This works efficiently for documents up to 500 pages.

What output formats are supported?

We implement 5 types: brief summary (3-5 sentences), bullet points, mind-map JSON, Q&A, and action items. For structured output, we use response_format: json_object in the OpenAI API — the model must return valid JSON without a Markdown wrapper.

How is offline access to summaries ensured?

Summaries are stored locally: on iOS with Core Data and a full-text index via FTS5, on Android with Room and FTS4/FTS5. For semantic search, we use vector embeddings with server-side caching via pgvector.

How long does development take?

Basic API summarization takes 2-3 days, map-reduce with multiple formats takes 1.5 weeks, and live summarization with transcription takes 3-4 weeks. Timelines are refined after requirement analysis.

How does AI handle long documents?

We use a map-reduce approach: split the text into chunks of 2000-3000 tokens with overlap, summarize each chunk in parallel, then merge the results into a final summary. This works efficiently for documents up to 500 pages.

What output formats are supported?

We implement 5 types: brief summary (3-5 sentences), bullet points, mind-map JSON, Q&A, and action items. For structured output, we use response_format: json_object in the OpenAI API — the model must return valid JSON without a Markdown wrapper.

How is offline access to summaries ensured?

Summaries are stored locally: on iOS with Core Data and a full-text index via FTS5, on Android with Room and FTS4/FTS5. For semantic search, we use vector embeddings with server-side caching via pgvector.

How long does development take?

Basic API summarization takes 2-3 days, map-reduce with multiple formats takes 1.5 weeks, and live summarization with transcription takes 3-4 weeks. Timelines are refined after requirement analysis.

Implementing AI-Powered Text Summarization in Mobile Apps

Q: Can summarization happen in real time?

Yes, via live transcription from the microphone: 30-second audio fragments are processed by SpeechRecognizer, and the accumulated transcript is summarized with a rolling window every 2000 words. Supported on iOS (AVAudioEngine + SFSpeechRecognizer) and Android (SpeechRecognizer + MediaRecorder).

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 1734 services

Implementing AI-Powered Text Summarization in Mobile Apps

Simple

~2-3 days

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
858
Development of a mobile application for XOOMER
745
Development of a mobile application for RHL
1162
Development of a mobile application for ZIPPY
1034
Development of a mobile application for Affhome
968
Development of a mobile application for the FLAVORS company
563

Show more works

Problem: LLM Context Window and Streaming Data

A typical scenario: a user dictates a lecture or pastes a 100-page PDF. The LLM takes the text but the context window overflows — the response cuts off or ignores the middle. Or a meeting transcript streams in, and a summary is needed immediately after. We solve these problems on iOS and Android with map-reduce summarization, live transcription, and structured output. We use Swift 5.9, Kotlin, Flutter 3.x, gpt-4o, Core Data, and Room. Our solutions include adaptive chunking, parallel processing, and local caching of summaries for offline access. Experience — over 50 projects with AI integration.

For example, for one EdTech client, we implemented live lecture summarization: 30-second audio fragments are processed by SpeechRecognizer, and the accumulated transcript is summarized with a rolling window every 2000 words. The result — structured notes available 5 minutes after the lecture. The key technique is map-reduce: split the document into chunks of 2500 tokens with 200-token overlap, summarize in parallel, then reduce into a final summary. This allows processing documents up to 500 pages without losing coherence.

How Map-Reduce Summarization Solves the Long Document Problem

gpt-4o supports 128k tokens of context, but running the entire document through every time is expensive and slow. The standard pattern — MapReduce:

Split the document into chunks of 2000–3000 tokens with ~200-token overlap
Summarize each chunk independently (map)
Summarize the list of summaries into a final summary (reduce)

More about chunking parameters

In practice, the chunk size depends on the model: for gpt-4o-mini, 3000 tokens is optimal, for gpt-3.5-turbo — 2000. The 200-token overlap ensures no sentence is broken at chunk boundaries.

// iOS
func summarizeDocument(_ text: String) async throws -> String {
    let chunks = chunkText(text, maxTokens: 2500, overlap: 200)

    // Parallel summarization of chunks
    let partialSummaries = try await withThrowingTaskGroup(of: String.self) { group in
        for chunk in chunks {
            group.addTask { try await self.summarizeChunk(chunk) }
        }
        var results = [String]()
        for try await result in group { results.append(result) }
        return results
    }

    // Final reduce
    let combined = partialSummaries.joined(separator: "\n\n")
    return try await summarizeChunk(combined, isFinal: true)
}

func chunkText(_ text: String, maxTokens: Int, overlap: Int) -> [String] {
    // ~4 characters = 1 token for Russian text (approximate)
    let chunkSize = maxTokens * 3
    let overlapSize = overlap * 3

    var chunks = [String]()
    var start = text.startIndex
    while start < text.endIndex {
        let end = text.index(start, offsetBy: chunkSize, limitedBy: text.endIndex) ?? text.endIndex
        chunks.append(String(text[start..<end]))
        guard let nextStart = text.index(start, offsetBy: chunkSize - overlapSize, limitedBy: text.endIndex) else { break }
        start = nextStart
    }
    return chunks
}

withThrowingTaskGroup allows parallel execution of tasks for each chunk. For 10 chunks, this is 5–7 times faster than sequential processing.

Why Structured Output Improves UX

Summaries can be of several types. Prompts for each:

Type	Prompt Instruction
Brief summary	«Summarize in 3-5 sentences. Key points only.»
Bullets	«Extract 5-8 key points as bullet list. Each point = one idea.»
Mind-map JSON	«Return JSON: {title, branches: [{topic, subtopics: []}]}»
Q&A	«Generate 5 questions and answers based on the text.»
Action items	«Extract only action items and deadlines. Format: - [Task]: [Deadline/Owner]»

For structured output, we use response_format: { type: "json_object" } in the OpenAI API — the model must return valid JSON without a markdown wrapper.

let requestBody: [String: Any] = [
    "model": "gpt-4o-mini",
    "messages": messages,
    "response_format": ["type": "json_object"],
    "temperature": 0.2
]

Live Transcription: Real-Time Audio Processing

If the source is a microphone (lecture recording, meeting), the summary builds on top of transcription. The flow:

AVAudioEngine → 30-second fragments → SpeechRecognizer (Whisper API or native SFSpeechRecognizer) → accumulated transcript → summarization with rolling window.

// Summarize every 5 minutes of transcript with overlap
class LiveSummaryEngine {
    private var transcript = ""
    private var lastSummaryLength = 0

    func onNewTranscript(_ chunk: String) {
        transcript += " " + chunk

        // Summarize new block when ~2000 words accumulated
        let wordCount = transcript.split(separator: " ").count
        if wordCount - lastSummaryLength > 2000 {
            Task { await summarizeNewBlock() }
            lastSummaryLength = wordCount
        }
    }

    private func summarizeNewBlock() async {
        let newContent = transcript.components(separatedBy: " ")
            .dropFirst(max(0, lastSummaryLength - 200))  // overlap 200 words
            .joined(separator: " ")

        let summary = try? await llmService.summarize(newContent)
        await MainActor.run { appendToNotes(summary ?? "") }
    }
}

On Android, use SpeechRecognizer + MediaRecorder with chunking by RECOGNIZER_RESULT_STABILITY.

Where to Store Summaries?

Summaries must be available offline and support search. On iOS — Core Data or SwiftData with full-text index via NSPersistentStoreDescription with SQLite FTS5. According to Apple documentation, FTS5 full-text index speeds up search by 10x. On Android — Room with @Fts4 or @Fts5 annotation.

Semantic search (by meaning, not words) — via vector embeddings stored locally in SQLite-VSS or on the server via pgvector. For mobile apps, server-side embedding search with cached results is sufficient.

Step-by-Step Implementation Plan

Requirements analysis: determine content type (text/audio), frequency, need for offline access.
Model selection: gpt-4o-mini for speed, gpt-4o for complex cases.
Implement chunking and summarization: use map-reduce with parallel tasks.
Integrate transcription: connect AVAudioEngine/SpeechRecognizer on iOS or SpeechRecognizer on Android.
Configure storage: choose Core Data or Room with FTS for search.
Testing: run on real data, verify summarization quality.

Each stage is accompanied by architectural documentation and source code. We also train your team on AI features and provide a warranty of up to 6 months after delivery.

What's Included

Architecture and API integration documentation
Source code with comments and tests
Repository access (Git) and CI/CD pipeline
Team training (2 sessions)
6-month warranty support

Estimated Timelines

Task	Timeframe
Basic API summarization	2–3 days
Map-reduce + multiple output formats	1.5 weeks
Live summarization with transcription	3–4 weeks

Cost is calculated individually after project analysis. Time savings on materials preparation for one project reached 80%.

Our Experience and Guarantees

We specialize in mobile development with AI for over 6 years. We have completed 50+ projects, including apps with document summarization, speech-to-text, and live transcription. We use a modern stack: Swift 5.9, Kotlin, Flutter 3.x, OpenAI API, Firebase, Core Data, Room. We guarantee compliance with App Store Review Guidelines and Google Play Policy.

Machine Learning in Mobile Apps: CoreML, TFLite, and On-Device Models

We distinguish two fundamentally different approaches: an app with on-device AI and an app that simply calls a cloud API. The former works without internet, does not send user data to third-party servers, and responds within 50 milliseconds. The latter depends on network latency and pricing plans. Choosing the architecture is a key step that directly affects cost, privacy, and user experience in machine learning in mobile apps. Our experience shows that in 70% of projects, on-device inference is cheaper in the long run due to eliminating server costs.

How to Choose Between CoreML and TFLite for On-Device Inference?

CoreML — Apple's native framework for running ML models on device. Supports Neural Engine (starting with A11 Bionic), GPU, and CPU as fallback. Models are converted to .mlmodel format via coremltools from PyTorch, ONNX, or TensorFlow. Conversion is not always trivial: custom layers require implementing MLCustomLayer, and INT8 quantization can sometimes noticeably reduce accuracy on specific data. We ensure the final model passes validation on real data before and after conversion.

TensorFlow Lite — cross-platform alternative for Android and Flutter. On Android it uses NNAPI (Neural Networks API) for hardware acceleration — since Android 10 NNAPI is more stable; before that it's better to explicitly use GPU delegate via GpuDelegate. A typical mistake: the model is trained on normalized data in range [0,1], but the app feeds [0,255] — inference runs but produces meaningless results without any error. We include an automatic input data validation module in the SDK.

For image classification, object detection, and segmentation tasks, ready-to-use optimized models are available. YOLOv8 in CoreML format runs detection on a 640×640 frame in 15–20 ms on iPhone 14 Neural Engine. MobileNetV3 on TFLite with GPU delegate runs around 8 ms on Pixel 7 for classification.

Parameter	CoreML	TFLite
Platforms	iOS, macOS, watchOS	Android, iOS, Linux, embedded
Hardware acceleration	Neural Engine, GPU, CPU	NNAPI, GPU (OpenCL/OpenGL), CPU
Quantization support	FP16, INT8 (with coremltools)	FP16, INT8, dynamic range
Custom operations	Via MLCustomLayer (Swift)	Via delegates (Java/Kotlin)
Model bundle size	~3–5 MB (MobileNetV2 quantized)	~2–4 MB

What If You Need Text Generation On-Device?

Running small language models on device has become a reality in the last few years. Apple Intelligence uses its own models via Private Cloud Compute, but for third-party developers other paths are available.

llama.cpp with Metal backend on iOS is a working approach for phi-3-mini (3.8B parameters, 4-bit quantization, ~2.3 GB). Inference: 15–25 tokens/second on iPhone 15 Pro. For integration in Swift, use the Swift Package llama.swift or a wrapper via C interface llama.h. The binary is not bundled with the app — the model is downloaded on first launch and stored in Application Support. Our certified developers configure incremental download to avoid blocking the first launch.

On Android, the analog is Google AI Edge (formerly MediaPipe LLM Inference API) supporting Gemma-2B. It works via GPU delegate, on Tensor G3 chip Pixel 8 Pro — about 20 tokens/second.

Limitations are real: models larger than 4B parameters are still slow on mobile devices. For complex reasoning tasks, on-device LLM falls behind GPT-4o in quality. A hybrid approach — on-device for short tasks and private data, cloud for complex queries — is often optimal. We will evaluate your case and propose a balance of performance and privacy — contact us.

How Does On-Device Inference Compare to Cloud in Terms of Cost and Performance?

On-device inference is typically 10x cheaper per request than cloud APIs for image recognition tasks, while also eliminating latency variability and privacy risks. The table below summarizes the trade-offs.

Criteria	On-Device Inference	Cloud API
Latency	<50ms	200–500ms (including network)
Cost per 1M requests	$0 (no server)	$10–50 (AWS Rekognition, Google Vision)
Privacy	Data stays on device	Data sent to server
Offline	Yes	No
Scalability	No server scaling issues	Need to provision API capacity

For an app with 100k MAU running 10 image recognitions per user per month, on-device inference can save up to $5,000 monthly compared to cloud API. Get a free consultation on your ML architecture today.

Integrating OpenAI API and Other Cloud Models

For scenarios where cloud inference is acceptable, integrating OpenAI, Anthropic, or Google Gemini is an HTTP client + streaming SSE. In Swift, AsyncThrowingStream is convenient for streaming responses. In Kotlin, use Flow.

Critically: API keys must never be stored in the app bundle. Even an obfuscated key can be extracted from the IPA in 10 minutes using strings or frida. Correct architecture: mobile app → your own backend → OpenAI API. The backend controls rate limiting, logs requests, and protects the key.

What Is Included in the Work (Deliverables)

Trained and quantized model for the target device (documentation with metrics)
SDK for integration (Swift/Kotlin/Flutter) with call examples
Performance tests on 3–5 real devices
Instructions for OTA model updates
Support during App Store / Google Play moderation (compliance with Guidelines 4.2, 5.1)
2 weeks of technical support after release

Typical Project Pipeline

Task analysis — measure latency, privacy, size, supported devices.
Model prototyping — in Python, evaluate accuracy on target data.
Conversion and quantization — for CoreML/TFLite with validation.
Integration into the app — model wrapped in a service layer (easy to swap CoreML ↔ TFLite ↔ cloud).
Testing — on real devices, measure FPS, RAM, battery.
Deployment — via TestFlight / Firebase App Distribution, monitor metrics.

Timelines: integration of a ready CoreML/TFLite model — 1–2 weeks, development of a custom model with mobile optimization — from 6 weeks, on-device LLM chat with personalization — 4–8 weeks.

Why We Take on Complex Cases?

10+ years of experience in mobile development, 50+ implemented AI/ML solutions, guarantee of compatibility with current iOS and Android versions. All projects undergo code review and load testing. The cost includes preparation of moderation documentation and training of your team.

Contact us — we will help you choose the architecture and implement ML in your app turnkey. Order an audit of your existing solution — we will assess the potential for server cost savings free of charge. In some projects, savings can reach significant amounts per month.