Which models do you use for text autocomplete?

We select models based on latency and quality needs: compact n-gram or DistilGPT for simple next-word tasks, and large language models like GPT-4o, Claude 3.5, or LLaMA 3 for paragraph-level completion. Our architecture can mix models for speculative decoding.

How do you reduce latency during suggestion generation?

We combine streaming (SSE for instant first token), speculative decoding (small draft model verified by a large one), prefix caching, and input debouncing. Together, these keep p99 latency below 200 ms for live typing.

Can the model be adapted to specialized terminology?

Yes. We use domain-specific system prompts, few-shot examples, and LoRA fine-tuning updated monthly. For critical accuracy, we add RAG over your knowledge base using ChromaDB or pgvector, drastically improving relevance for legal, medical, or technical editors.

How long does implementation typically take?

A basic autocomplete (next-word or phrase) can be integrated in 2 weeks. A full system with RAG, fine-tuning, monitoring dashboard, and team training takes 4–6 weeks. We refine the timeline after an initial audit.

What output formats do you support?

We deliver suggestions via REST API, WebSocket, or SSE. Integration with common editors (CKEditor, TinyMCE, custom) is standard. Responses include a JSON array of completion variants with metadata like confidence scores.

Which models do you use for text autocomplete?

We select models based on latency and quality needs: compact n-gram or DistilGPT for simple next-word tasks, and large language models like GPT-4o, Claude 3.5, or LLaMA 3 for paragraph-level completion. Our architecture can mix models for speculative decoding.

How do you reduce latency during suggestion generation?

We combine streaming (SSE for instant first token), speculative decoding (small draft model verified by a large one), prefix caching, and input debouncing. Together, these keep p99 latency below 200 ms for live typing.

Can the model be adapted to specialized terminology?

Yes. We use domain-specific system prompts, few-shot examples, and LoRA fine-tuning updated monthly. For critical accuracy, we add RAG over your knowledge base using ChromaDB or pgvector, drastically improving relevance for legal, medical, or technical editors.

How long does implementation typically take?

A basic autocomplete (next-word or phrase) can be integrated in 2 weeks. A full system with RAG, fine-tuning, monitoring dashboard, and team training takes 4–6 weeks. We refine the timeline after an initial audit.

What output formats do you support?

We deliver suggestions via REST API, WebSocket, or SSE. Integration with common editors (CKEditor, TinyMCE, custom) is standard. Responses include a JSON array of completion variants with metadata like confidence scores.

LLM Text Completion: Latency Under 200ms with Streaming & Speculative Decoding

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

LLM Text Completion: Latency Under 200ms with Streaming & Speculative Decoding

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

LLM Text Completion: Latency Under 200ms with Streaming & Speculative Decoding

User types a sentence, the system hangs for a second before showing a suggestion — does it sound familiar? This is the typical latency challenge of large language models in autocomplete: high p99 kills user experience. In production, we've seen users abandon input fields when latency exceeds 500 ms. We solve this by combining streaming, speculative decoding, and prefix caching, delivering suggestions under 200 ms without compromising prediction quality. The result? Users save up to 30% typing time thanks to relevant suggestions.

We deliver turnkey text completion systems — from simple n-grams to full LLM assistants with RAG and context adaptation. Our engineers have over 10 years of combined experience in NLP and MLOps, with hands-on work on OpenAI, Hugging Face, and vLLM. We assess your project in 1–2 days and propose a tailored architecture.

Types of Autocomplete and Their Limitations

Type	Latency	Example Use Case	Model
Next word	<20 ms	Mobile keyboard	N-gram, small RNN
Phrase	<100 ms	Search suggestions	DistilGPT, BERT
Paragraph	<500 ms	AI writing assistant	GPT-4o, Claude 3.5

The first two types are handled with fastText or small transformers; the third requires an LLM with generation. We help you pick the optimal fit for your scenarios.

Problems We Solve

High latency. In live typing, every millisecond counts. We use streaming via SSE — the first token appears in 100–150 ms, so the user sees the beginning of a suggestion almost instantly. We additionally apply speculative decoding: a small model (e.g., GPT-4o-mini) drafts, and a large model (GPT-4o) verifies. This yields 2–3× speedup. More about speculative decoding can be read on Wikipedia.

Context mismatch. Without context, models produce generic phrases. We feed the prompt with document topic, writing style, previous paragraphs, and key terms. For specialized editors (legal, medical), we use LoRA fine-tuning or a system prompt with a domain vocabulary.

Hallucinations and injections. Models may suggest inaccurate information or execute prompt injections. We block this through output validation and sandbox prompts. Additionally, we implement RAG: suggestions are grounded in your knowledge base, drastically cutting hallucinations.

Comparison of Latency Optimization Methods

Method	Speedup	Integration Complexity	Notes
Streaming	Up to 2×	Low	Faster first token
Speculative decoding	2–3×	Medium	Requires two models
Prefix caching	1.5–2×	Medium	Good for repeated prefixes
Debouncing	——	Low	Reduces load, doesn't speed generation

Example vLLM configuration

# vLLM with speculative decoding
from vllm import LLM, SamplingParams

llm = LLM(model="gpt-4o", speculative_model="gpt-4o-mini", num_speculative_tokens=5)
params = SamplingParams(temperature=0.7, max_tokens=50, n=3)

How We Achieve Sub-200ms Latency

Our strategy includes four layers:

Streaming — return tokens via SSE. The user sees the suggestion growing.
Speculative decoding — accelerate generation 2–3× without quality loss.
Caching — if the prefix hasn't changed, serve a cached result.
Debouncing — trigger only after 300–500 ms typing pause.

How We Adapt the Model to Your Domain

Context adaptation is key for relevant suggestions. We use:

System prompt describing the domain and desired style.
Few-shot examples from your own data.
LoRA fine-tuning for continuous adaptation (model updated monthly).
RAG on ChromaDB or pgvector — suggestions reference current documents.

Why Streaming Is Critical for UX

Streaming lets users see the beginning of a suggestion after 100–150 ms instead of waiting for full generation. This drastically reduces perceived latency. In an A/B test for a legal document editor, switching from batch to streaming increased suggestion acceptance rate by 25%. Users reported a "snappy" feel even though total generation time stayed similar.

Concrete Case: Legal Drafting Assistant

We built an autocomplete assistant for a law firm's internal editor. The original system used a GPT-4 endpoint with batch output: p99 latency was 1.2 seconds, causing frequent user drop-off. After implementing streaming (SSE), speculative decoding (GPT-4o-mini drafts, GPT-4o verifies), and prefix caching, latency dropped to 180 ms p99. The firm saw a 30% reduction in average document drafting time. The system was fine-tuned with LoRA on 10,000 proprietary contracts, plus RAG on the firm's clause database.

Implementation with OpenAI API

from openai import OpenAI

client = OpenAI()

def autocomplete(text_prefix: str, context: str = "") -> list[str]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"You help write text. Context: {context}"},
            {"role": "user", "content": f"Continue the text in three different ways:\n{text_prefix}"}
        ],
        max_tokens=50,
        n=3,
        temperature=0.7,
    )
    return [choice.message.content for choice in response.choices]

Process of Evaluation and Work

Analytics — audit current scenarios, collect data, define acceptable latency.
Design — select model (GPT-4o, Claude, LLaMA 3), inference architecture (vLLM, TGI), vectorize context.
Implementation — integrate API, set up streaming, caching, debouncing.
Testing — A/B tests, measure p99 latency, evaluate quality (relevance, hallucination rate).
Deployment — deploy on your infrastructure or in cloud (SageMaker, Vertex AI).

What's Included in the Deliverable

Complete autocomplete system with latency <200 ms.
API with documentation (OpenAPI spec).
Monitoring dashboard (latency, throughput, cache hit rate).
Maintenance and update instructions.
Team training (3–5 working days).

Timelines: from 2 weeks for a basic solution to 6 weeks for a system with RAG and fine-tuning. Cost is estimated after an initial audit — contact us to discuss your case and receive an architectural proposal.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.