Language Detection Implementation
Language detection—basic NLP task solved in milliseconds. Used as first step in multilingual pipelines: before applying language-specific model, you must know text language.
Tools and Selection
fasttext lid.176.bin—industrial standard. Facebook model, recognizes 176 languages:
import fasttext
model = fasttext.load_model("lid.176.bin")
predictions = model.predict("Hello, how are you?", k=3)
# (('__label__en', '__label__cy', '__label__is'), array([0.99, 0.003, 0.002]))
Latency: < 1ms. Model size: 1.8MB (bin) or 131MB (ftz). Accuracy: 97%+ for texts > 20 words.
langdetect (Python): Google Language Detection port, 55 languages. Drawback: non-deterministic (different results on repeated runs without seed fix).
langid.py: 97 languages, deterministic, worse than fasttext on short texts.
lingua-py: best accuracy for short texts (1–10 words), 75 languages.
Complex Cases
- Mixed text (code-switching): "Meet at 5pm on zoom call in Russian"—technically Russian, but with English insertions. Strategy: identify dominant language, don't segment by language for short texts
- Short texts (< 5 words): accuracy drops for all models. For critical cases use lingua-py or ensemble
- Closely related languages: Russian/Bulgarian/Serbian, Spanish/Portuguese—source of main errors
Multilingual Pipeline Application
def process_multilingual(text: str):
lang = detect_language(text) # "ru", "en", "de"
router = {
"ru": russian_pipeline,
"en": english_pipeline,
"de": german_pipeline,
}
pipeline = router.get(lang, default_pipeline)
return pipeline.run(text)
For production: cache language detection results by text hash—repeated requests avoid model call.







