What tools do you use for language detection?

We use FastText LID.176 as the industrial standard due to its speed (<1 ms) and coverage of 176 languages. For short texts, we additionally apply lingua-py, and for compatibility, langdetect.

How do you handle mixed-language text?

For code-switching (e.g., Russian with English inserts), we identify the dominant language without segmentation. For long texts, segmentation by language blocks can be applied using an ensemble of models.

What is FastText's accuracy on short texts?

For texts up to five words, FastText accuracy drops to 60-70%. In such cases, we recommend lingua-py or an ensemble of models, which achieve 85-90% on short fragments.

How many languages does langdetect support?

langdetect supports 55 languages. It is ported from Google Language Detection and easy to use, but suffers from non-determinism — results may vary between runs without fixing the seed.

How do you implement language detection in production?

We deploy the model in a pipeline via an API server with result caching by text hash. We use load balancing and monitor p99 latency. See the article for details on our process.

What tools do you use for language detection?

We use FastText LID.176 as the industrial standard due to its speed (<1 ms) and coverage of 176 languages. For short texts, we additionally apply lingua-py, and for compatibility, langdetect.

How do you handle mixed-language text?

For code-switching (e.g., Russian with English inserts), we identify the dominant language without segmentation. For long texts, segmentation by language blocks can be applied using an ensemble of models.

What is FastText's accuracy on short texts?

For texts up to five words, FastText accuracy drops to 60-70%. In such cases, we recommend lingua-py or an ensemble of models, which achieve 85-90% on short fragments.

How many languages does langdetect support?

langdetect supports 55 languages. It is ported from Google Language Detection and easy to use, but suffers from non-determinism — results may vary between runs without fixing the seed.

How do you implement language detection in production?

We deploy the model in a pipeline via an API server with result caching by text hash. We use load balancing and monitor p99 latency. See the article for details on our process.

Automatic Language Detection: Tools and Implementation for Multilingual NLP Pipelines

Q: How do you handle mixed-language text?

For code-switching (e.g., Russian with English inserts), we identify the dominant language without segmentation. For long texts, segmentation by language blocks can be applied using an ensemble of models.

Q: What is FastText's accuracy on short texts?

For texts up to five words, FastText accuracy drops to 60-70%. In such cases, we recommend lingua-py or an ensemble of models, which achieve 85-90% on short fragments.

Q: How many languages does langdetect support?

langdetect supports 55 languages. It is ported from Google Language Detection and easy to use, but suffers from non-determinism — results may vary between runs without fixing the seed.

Q: How do you implement language detection in production?

We deploy the model in a pipeline via an API server with result caching by text hash. We use load balancing and monitor p99 latency. See the article for details on our process.

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Automatic Language Detection: Tools and Implementation for Multilingual NLP Pipelines

Simple

~1 day

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

A client arrives with a task: to process texts in Russian, English, and German. But how to know which pipeline to run? Without fast language detection, a multilingual pipeline turns into a mess. We solved this by implementing a language detector based on FastText LID.176 — it works in a fraction of a second and covers 176 languages. In our practice, we have completed 30+ NLP pipeline implementations, including language detection for international chatbots with a load of 10,000 requests/sec. Time savings on processing reach 40%, and classification errors drop by 30%.

What problems does automatic language detection solve?

Language detection is the first step in any multilingual NLP pipeline. Before applying language-specific models (stemming, NER, syntax), you need to know exactly the language of the text. An error at this stage leads to incorrect processing of the entire document. For example, running a Russian stemmer on an English text yields useless results. Correct language detection directly impacts the quality of downstream tasks: machine translation, sentiment analysis, entity extraction.

Why FastText is the standard for language detection?

Facebook AI Research (FAIR) released FastText LID.176.bin, which became the industry standard. The model recognizes 176 languages, weighs only 1.8 MB (binarized version). Latency is less than 1 ms on texts up to 200 characters. Accuracy is 97%+ for texts longer than 20 words. Example code:

import fasttext
model = fasttext.load_model("lid.176.bin")
predictions = model.predict("Привет, как дела?", k=3)
# (('__label__ru', '__label__bg', '__label__mk'), array([0.99, 0.003, 0.002]))

Other tools fall behind in speed or accuracy. Compare them in the table:

Tool	Languages	Accuracy (20+ words)	Accuracy (1-5 words)	Determinism
FastText LID.176	176	97%	60-70%	Yes
langdetect	55	90%	50%	No
langid.py	97	93%	55%	Yes
lingua-py	75	95%	85-90%	Yes

FastText is 10x faster than langdetect with 7% higher accuracy on long texts. FastText is an open-source library available for integration.

How we handle tricky cases?

In practice, we often encounter scenarios where standard detectors fail:

Code-switching text: "Встречаемся в 5pm на zoom call" — technically Russian, but with English inserts. We keep the dominant language without attempting to segment short phrases. For long texts, we apply an ensemble: FastText + langid.
Short texts (up to 5 words): In chatbots, short messages like "Ok" or "Да" are common. FastText accuracy drops to 60-70%. A workaround is to use lingua-py, which is trained on n-grams and achieves 85-90% on short fragments.
Closely related languages: Russian/Bulgarian, Spanish/Portuguese — main sources of errors. We fine-tune the model on a corpus of these pairs, boosting accuracy to 98%.

Ensemble implementation details

For production, we recommend an ensemble: FastText for long texts, lingua-py for short texts (threshold — 20 words). This achieves 97%+ accuracy on any input.

How does language detection impact project economics?

Correct language detection reduces erroneously processed requests by 30%, saving up to 40% of time in subsequent stages (translation, data extraction). Monetarily, for a chatbot with 10,000 requests/sec, this can amount to $400–$800 monthly. The cost of integrating a language detector typically ranges from $1,000 to $2,000 depending on complexity. Payback period — 2–3 months.

Step-by-step guide to implement a language detector

Corpus collection: gather a representative sample of texts for each target language (at least 1000 documents per language).
Model testing: run FastText, lingua-py, and langdetect on a test set, measure accuracy and latency. For short texts, prioritize lingua-py.
Strategy selection: decide on an ensemble — e.g., FastText for long texts, lingua-py for short texts, with a length threshold of 20 words.
Optimization: implement caching (Redis) — for repeated messages, latency drops to hundreds of microseconds.
Monitoring: set accuracy metrics on a rolling window and p99 latency. If accuracy drops, automatically switch to a backup model.

What is included in the implementation work?

We offer turnkey language detector implementation. The scope includes:

Model selection and testing (FastText, lingua-py, or ensemble) for your scenario
Integration into a multilingual pipeline (REST API or gRPC)
Result caching (Redis, Memcached) to reduce load
Monitoring metrics: p99 latency, accuracy on test sample, FLOPS
Documentation and team training

Implementation stages:

Stage	Duration	Result
Analysis	3-5 days	Text corpus, model testing
Design	3-5 days	Architecture (server-based / serverless)
Implementation	5-10 days	Code, CI/CD, integration
Testing	3-5 days	A/B test, comparison with baseline
Deployment	3-5 days	Production, documentation

Timelines — from 2 to 4 weeks depending on integration complexity. Cost is calculated individually after data volume assessment.

Why choose us?

In our practice, we have completed 30+ NLP pipeline implementations, including language detection for international chatbots with a load of 10,000 requests/sec. We guarantee 97%+ accuracy and support at all stages. Contact us for a consultation — we will assess the load, select the optimal model, and embed language detection into your pipeline. Order language detector implementation — get 97%+ accuracy and sub-1ms latency.

Wikipedia: Language identification

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.