Implementing Context Window and Dialog History Management in a Mobile Application
Context is what transforms scattered questions into coherent conversation. Sending entire conversation history in each request—most naive solution. Works until first context overflow or API bill complaint. History management—separate engineering task requiring early design.
How Context Grows and Why It's a Problem
Each exchange adds tokens: user request + model response. Average message 50–100 tokens, 20 pairs—already 2000–4000 tokens just on history, plus system prompt. At GPT-4o $5 per 1M input tokens—trivial. At 1000 active users with 50 messages daily—$250/day just on history that could be more compact.
Second problem: different models have different limits. GPT-4o—128K, Claude—200K, YandexGPT—8K. App that worked fine with GPT-4o breaks switching to another model.
Three History Management Strategies
1. Sliding Window
Simplest approach: keep last N messages, discard earlier. Fast, predictable. Minus: model "forgets" conversation start—user name, agreements from early messages.
func buildMessages(history: [Message], systemPrompt: String, maxTokens: Int = 3000) -> [Message] {
var result: [Message] = []
var tokenCount = countTokens(systemPrompt)
// Go from end of history
for message in history.reversed() {
let msgTokens = countTokens(message.content)
if tokenCount + msgTokens > maxTokens { break }
result.insert(message, at: 0)
tokenCount += msgTokens
}
return result
}
2. Summarization
When history exceeds threshold—send accumulated messages for summarization via cheaper model (gpt-4o-mini, claude-haiku, mistral-small). Get summary, save as system message or special assistant block, remove summarized messages from active history.
Problem: specific facts are lost ("user said allergic to penicillin"). For medical, legal, financial assistants—summarization without explicit fact preservation is risky.
3. Hybrid with Memory
Most reliable for long-term assistants:
- Short-term memory — last 10–15 messages, always in context
- Long-term memory — structured facts about user and conversation, stored separately
- Semantic search — fetch relevant facts from long-term memory via embeddings on each request
Long-term memory updates via additional call: after each model response, ask model to extract facts for memory ("What new facts about the user can be extracted from this dialog?").
Storing History on Mobile
SQLite—standard. Structure:
CREATE TABLE conversations (
id TEXT PRIMARY KEY,
created_at INTEGER,
title TEXT,
model TEXT,
summary TEXT -- summarization of old messages
);
CREATE TABLE messages (
id TEXT PRIMARY KEY,
conversation_id TEXT REFERENCES conversations(id),
role TEXT CHECK(role IN ('user', 'assistant', 'system')),
content TEXT,
token_count INTEGER,
created_at INTEGER
);
CREATE INDEX idx_messages_conversation ON messages(conversation_id, created_at);
token_count calculated on save—not on every load. Important for performance with long histories.
Token Counting on Mobile
Accurate counting requires tokenizer for specific model. On server—tiktoken for OpenAI, tokenizers from HuggingFace for others. On mobile, usually heuristics:
- English text: ~4 characters ≈ 1 token
- Russian text: ~2–2.5 characters ≈ 1 token (Cyrillic encodes to more tokens)
- Code: ~3 characters ≈ 1 token
For responsible counting (billing, limits)—server-side validation.
UI: Displaying History
Message list—UITableView in reverse order (new at bottom) or LazyColumn in Compose with reverseLayout = true. During streaming, last message updates in place without scroll jump.
Context window indicator: show user how much "memory" is used—visual bar or token counter. Not essential, but apps that add it get fewer complaints about assistant "forgetfulness."
Timeline Estimates
Sliding window with SQLite storage—3–4 days. Hybrid system with summarization and long-term memory—1.5–2.5 weeks.







