Implementation of Text Autocomplete System
Text autocomplete suggests the next words or phrases to the user during input. Applications range from search suggestions to full AI assistant in text editors.
Types of Autocomplete
Next word/token (predictive typing): predicting one or two next words. Used in mobile keyboards and search. Models: small n-gram or RNN, latency < 20ms is critical.
Phrase completion: based on sentence beginning, suggest several completion options. Example: Google search suggestions.
Paragraph completion (full AI assist): GitHub Copilot-style — completing a paragraph or text block. Requires a more powerful model.
Implementation with LLM
from openai import OpenAI
client = OpenAI()
def autocomplete(text_prefix: str, context: str = "") -> list[str]:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"You help write texts. Context: {context}"},
{"role": "user", "content": f"Continue the text with three different options:\n{text_prefix}"}
],
max_tokens=50,
n=3, # multiple options
temperature=0.7,
)
return [choice.message.content for choice in response.choices]
Latency Optimization for Real-time
For live input, latency must be < 200ms. Strategies:
Streaming: return tokens as they're generated via SSE (Server-Sent Events). First token appears in 100–200ms, fast response feel.
Speculative decoding: small model generates draft, large model validates — 2–3x faster at same quality.
Caching: if user hasn't changed last N characters — return cached suggestion.
Debouncing: trigger completion only after 300–500ms pause in input.
Contextual Adaptation
Autocomplete quality dramatically improves with document context. Pass to prompt: document topic, style (technical/business/conversational), previous paragraphs. For specialized editors (legal, medical): system prompt with domain dictionary.







