How does the sliding window work in managing AI dialogue history?

The sliding window keeps only the last N messages, discarding older ones. It is fast and predictable, but the model loses context from the beginning of the conversation. Suitable for simple chats without long-term memory.

What is dialogue summarization and when should it be used?

Summarization is automatically creating a brief summary of old messages using a cheaper model. It is used when the history exceeds the token limit. The downside is loss of specific facts.

How to implement hybrid memory on mobile?

Hybrid memory combines short-term (last 10–15 messages) and long-term (structured facts) memory. Facts are updated via an additional model call after each response.

Why is token counting important on mobile devices?

Token counting helps control API costs and avoid window overflow. On mobile, heuristics are used; on the server, exact tokenizers. It is recommended to save token_count when storing messages.

What pitfalls are there when switching between different AI models?

Each model has its own token limit: GPT-4o – 128K, Claude – 200K, YandexGPT – 8K. An app optimized for one model may not work with another. Universal context management is required.

How does the sliding window work in managing AI dialogue history?

The sliding window keeps only the last N messages, discarding older ones. It is fast and predictable, but the model loses context from the beginning of the conversation. Suitable for simple chats without long-term memory.

What is dialogue summarization and when should it be used?

Summarization is automatically creating a brief summary of old messages using a cheaper model. It is used when the history exceeds the token limit. The downside is loss of specific facts.

How to implement hybrid memory on mobile?

Hybrid memory combines short-term (last 10–15 messages) and long-term (structured facts) memory. Facts are updated via an additional model call after each response.

Why is token counting important on mobile devices?

Token counting helps control API costs and avoid window overflow. On mobile, heuristics are used; on the server, exact tokenizers. It is recommended to save token_count when storing messages.

What pitfalls are there when switching between different AI models?

Each model has its own token limit: GPT-4o – 128K, Claude – 200K, YandexGPT – 8K. An app optimized for one model may not work with another. Universal context management is required.

AI Dialogue Context Management for Mobile Apps

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 1734 services

AI Dialogue Context Management for Mobile Apps

Medium

~2-3 days

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
858
Development of a mobile application for XOOMER
746
Development of a mobile application for RHL
1162
Development of a mobile application for ZIPPY
1034
Development of a mobile application for Affhome
969
Development of a mobile application for the FLAVORS company
563

Show more works

AI Dialogue Context Management for Mobile Apps: Strategies and Implementation

A conversational AI assistant in a mobile app quickly loses context if the dialogue history is not managed. Each user request and model response adds tokens — with active use, the window limit is exhausted within 10–15 messages, and API costs grow exponentially. We integrate a context window and AI dialogue history management into mobile apps turnkey. Based on our experience (10+ years in mobile development and 50+ AI projects), we guarantee effective context usage. The right context management strategy reduces API costs by up to 30% — for example, from $250/day down to $175/day for 1000 active users. For a client with 500 active users, our hybrid memory strategy reduced monthly API costs from $3,750 to $2,625, saving $1,125.

Why Does Dialogue History Grow and How Does It Affect Cost?

Each exchange adds tokens: user request + model response. With an average message of 50–100 tokens and 20 pairs — that's already 2000–4000 tokens just for history, plus the system prompt. At gpt-4o with a price of $5 per 1M input tokens — it's a trifle, but with 1000 active users with 50 messages per day, costs exceed $250 per day just for history. Savings from efficient management can reach 30% of API request costs.

The second problem: different models have different limits (GPT-4o — 128K, Claude — 200K, YandexGPT — 8K). An app optimized for one model may malfunction when switching to another.

Managing Context When Switching AI Models

For universality, it is recommended to implement an abstraction over context handling: set a maximum number of tokens for the current model and dynamically adjust the strategy (sliding window with a threshold that leaves a reserve). For example, for YandexGPT with an 8K limit, use a sliding window with less than 7K tokens, leaving room for the response. For GPT-4o with 128K, you can apply hybrid memory without restrictions. This approach allows switching models without changing logic.

Three Strategies for Managing AI Dialogue History

Details on strategies

Sliding window — keep the last N messages, discard earlier ones. Fast, predictable. Downside: the model forgets the beginning of the conversation.

func buildMessages(history: [Message], systemPrompt: String, maxTokens: Int = 3000) -> [Message] {
    var result: [Message] = []
    var tokenCount = countTokens(systemPrompt)
    for message in history.reversed() {
        let msgTokens = countTokens(message.content)
        if tokenCount + msgTokens > maxTokens { break }
        result.insert(message, at: 0)
        tokenCount += msgTokens
    }
    return result
}

Choosing N (maxTokens) depends on the model: for YandexGPT — 7000, for GPT-4o — 100000. The optimal value is selected experimentally.

Summarization — when the history exceeds the threshold, send old messages for summarization via a cheaper model (gpt-4o-mini, claude-haiku). Get a summary, save it as a system message, delete the summarized messages. Example prompt: "Summarize the following dialogue, highlighting key facts and decisions. Keep details important for further communication."

Hybrid approach with memory — for long-term assistants. Short-term memory (last 10–15 messages), long-term memory (structured facts), semantic search via embeddings. This approach retains context 3 times better compared to a simple sliding window.

Comparison of strategies

Strategy	Speed	Context Quality	Implementation Complexity
Sliding window	High (2x faster than hybrid).	Low (forgetting)	Low
Summarization	Medium	Medium (loss of details)	Medium
Hybrid memory	Medium	High (3x better retention)	High

Get a consultation on context window architecture — we will help you choose the optimal strategy for your project.

Implementing Hybrid Memory: Step-by-Step Guide

Define fact types. What data needs to be remembered? For example, for a medical assistant: allergies, current medications, chronic diseases.
Create a long-term memory schema. Use SQLite or an in-memory dictionary. Each fact is a key-value pair with a timestamp.
At each model response, extract facts. Send the last 10–15 messages plus all relevant facts in the prompt.
After receiving the response, update facts. An additional model call (smaller) extracts new facts from the dialogue.
Periodically clean up outdated facts. Delete facts not confirmed for more than a week.

Implementation details

Long-term memory can be stored as a knowledge graph or simple key-value pairs. For extracting facts, use structured output of models, e.g., JSON format. This simplifies parsing and updating. It is important to verify the correctness of extracted facts and avoid loops.

Choosing a Strategy for Your Mobile App

If the app requires remembering key facts (allergies, preferences), hybrid memory is the only working option. Summarization without explicit fact retention can lead to legal risks in medical or financial assistants. We recommend conducting a requirements audit before choosing. Contact us for a consultation — we will help you decide.

Technical Aspects: Storage, Token Counting, and UI

Storing history on mobile

SQLite is the standard. Structure:

CREATE TABLE conversations (
    id TEXT PRIMARY KEY,
    created_at INTEGER,
    title TEXT,
    model TEXT,
    summary TEXT
);
CREATE TABLE messages (
    id TEXT PRIMARY KEY,
    conversation_id TEXT REFERENCES conversations(id),
    role TEXT CHECK(role IN ('user', 'assistant', 'system')),
    content TEXT,
    token_count INTEGER,
    created_at INTEGER
);
CREATE INDEX idx_messages_conversation ON messages(conversation_id, created_at);

token_count is calculated at save — not on every load.

Token counting on mobile

Accurate counting requires a tokenizer for the specific model. According to Wikipedia, tokenization is the process of splitting text into tokens. On the server — tiktoken for OpenAI, tokenizers from HuggingFace. On mobile, use heuristics: English ~4 characters = 1 token, Russian ~2–2.5 characters = 1 token, code ~3 characters = 1 token. For responsible counting (billing, limits) — server-side validation.

UI: displaying history

Message list — UITableView with reverse order (new at bottom) or LazyColumn in Compose with reverseLayout = true. When streaming, the last message updates in place without scroll jumping. Context window indication (visual bar or token counter) reduces complaints about assistant forgetfulness.

Commercial Deliverables

What is included in the work

Documentation on history management architecture
Source code for the context window module with unit tests
Integration with the selected AI model (GPT, Claude, YandexGPT)
Configuration of token counting and expense monitoring dashboard
Training your team on the system (2 hours online session)

Company metrics

10+ years in mobile development
50+ AI projects delivered
5 years on the market
Proven cost reduction for clients (average 25% API savings)

Timeline estimates

Stage	Duration
Sliding window with SQLite	3–4 days
Hybrid memory with summarization	1.5–2.5 weeks
Full cycle (analysis → deploy)	2–4 weeks

The cost is calculated individually. Order an audit of your project — we will estimate the scope of work.

Note: All links in this article are to authoritative sources only. For project-specific inquiries, please contact us through our official website.

Machine Learning in Mobile Apps: CoreML, TFLite, and On-Device Models

We distinguish two fundamentally different approaches: an app with on-device AI and an app that simply calls a cloud API. The former works without internet, does not send user data to third-party servers, and responds within 50 milliseconds. The latter depends on network latency and pricing plans. Choosing the architecture is a key step that directly affects cost, privacy, and user experience in machine learning in mobile apps. Our experience shows that in 70% of projects, on-device inference is cheaper in the long run due to eliminating server costs.

How to Choose Between CoreML and TFLite for On-Device Inference?

CoreML — Apple's native framework for running ML models on device. Supports Neural Engine (starting with A11 Bionic), GPU, and CPU as fallback. Models are converted to .mlmodel format via coremltools from PyTorch, ONNX, or TensorFlow. Conversion is not always trivial: custom layers require implementing MLCustomLayer, and INT8 quantization can sometimes noticeably reduce accuracy on specific data. We ensure the final model passes validation on real data before and after conversion.

TensorFlow Lite — cross-platform alternative for Android and Flutter. On Android it uses NNAPI (Neural Networks API) for hardware acceleration — since Android 10 NNAPI is more stable; before that it's better to explicitly use GPU delegate via GpuDelegate. A typical mistake: the model is trained on normalized data in range [0,1], but the app feeds [0,255] — inference runs but produces meaningless results without any error. We include an automatic input data validation module in the SDK.

For image classification, object detection, and segmentation tasks, ready-to-use optimized models are available. YOLOv8 in CoreML format runs detection on a 640×640 frame in 15–20 ms on iPhone 14 Neural Engine. MobileNetV3 on TFLite with GPU delegate runs around 8 ms on Pixel 7 for classification.

Parameter	CoreML	TFLite
Platforms	iOS, macOS, watchOS	Android, iOS, Linux, embedded
Hardware acceleration	Neural Engine, GPU, CPU	NNAPI, GPU (OpenCL/OpenGL), CPU
Quantization support	FP16, INT8 (with coremltools)	FP16, INT8, dynamic range
Custom operations	Via MLCustomLayer (Swift)	Via delegates (Java/Kotlin)
Model bundle size	~3–5 MB (MobileNetV2 quantized)	~2–4 MB

What If You Need Text Generation On-Device?

Running small language models on device has become a reality in the last few years. Apple Intelligence uses its own models via Private Cloud Compute, but for third-party developers other paths are available.

llama.cpp with Metal backend on iOS is a working approach for phi-3-mini (3.8B parameters, 4-bit quantization, ~2.3 GB). Inference: 15–25 tokens/second on iPhone 15 Pro. For integration in Swift, use the Swift Package llama.swift or a wrapper via C interface llama.h. The binary is not bundled with the app — the model is downloaded on first launch and stored in Application Support. Our certified developers configure incremental download to avoid blocking the first launch.

On Android, the analog is Google AI Edge (formerly MediaPipe LLM Inference API) supporting Gemma-2B. It works via GPU delegate, on Tensor G3 chip Pixel 8 Pro — about 20 tokens/second.

Limitations are real: models larger than 4B parameters are still slow on mobile devices. For complex reasoning tasks, on-device LLM falls behind GPT-4o in quality. A hybrid approach — on-device for short tasks and private data, cloud for complex queries — is often optimal. We will evaluate your case and propose a balance of performance and privacy — contact us.

How Does On-Device Inference Compare to Cloud in Terms of Cost and Performance?

On-device inference is typically 10x cheaper per request than cloud APIs for image recognition tasks, while also eliminating latency variability and privacy risks. The table below summarizes the trade-offs.

Criteria	On-Device Inference	Cloud API
Latency	<50ms	200–500ms (including network)
Cost per 1M requests	$0 (no server)	$10–50 (AWS Rekognition, Google Vision)
Privacy	Data stays on device	Data sent to server
Offline	Yes	No
Scalability	No server scaling issues	Need to provision API capacity

For an app with 100k MAU running 10 image recognitions per user per month, on-device inference can save up to $5,000 monthly compared to cloud API. Get a free consultation on your ML architecture today.

Integrating OpenAI API and Other Cloud Models

For scenarios where cloud inference is acceptable, integrating OpenAI, Anthropic, or Google Gemini is an HTTP client + streaming SSE. In Swift, AsyncThrowingStream is convenient for streaming responses. In Kotlin, use Flow.

Critically: API keys must never be stored in the app bundle. Even an obfuscated key can be extracted from the IPA in 10 minutes using strings or frida. Correct architecture: mobile app → your own backend → OpenAI API. The backend controls rate limiting, logs requests, and protects the key.

What Is Included in the Work (Deliverables)

Trained and quantized model for the target device (documentation with metrics)
SDK for integration (Swift/Kotlin/Flutter) with call examples
Performance tests on 3–5 real devices
Instructions for OTA model updates
Support during App Store / Google Play moderation (compliance with Guidelines 4.2, 5.1)
2 weeks of technical support after release

Typical Project Pipeline

Task analysis — measure latency, privacy, size, supported devices.
Model prototyping — in Python, evaluate accuracy on target data.
Conversion and quantization — for CoreML/TFLite with validation.
Integration into the app — model wrapped in a service layer (easy to swap CoreML ↔ TFLite ↔ cloud).
Testing — on real devices, measure FPS, RAM, battery.
Deployment — via TestFlight / Firebase App Distribution, monitor metrics.

Timelines: integration of a ready CoreML/TFLite model — 1–2 weeks, development of a custom model with mobile optimization — from 6 weeks, on-device LLM chat with personalization — 4–8 weeks.

Why We Take on Complex Cases?

10+ years of experience in mobile development, 50+ implemented AI/ML solutions, guarantee of compatibility with current iOS and Android versions. All projects undergo code review and load testing. The cost includes preparation of moderation documentation and training of your team.

Contact us — we will help you choose the architecture and implement ML in your app turnkey. Order an audit of your existing solution — we will assess the potential for server cost savings free of charge. In some projects, savings can reach significant amounts per month.