NLP: Text Classification, NER, Embeddings, and Information Extraction
Task arrives: "classify support requests — around 50,000 per month, currently all manual." 3,000 labeled examples, 12 categories, imbalance: one category 40% of set, three categories 1-2% each. Baseline accuracy 78%. Looks decent until you check rare class recall: 0.31, 0.44, 0.28. These classes — complaints and churn threats — matter most to business.
This is typical NLP project. Problem is not algorithm, it's that accuracy wrong metric.
Text Classification: From BERT to Distillation
BERT-like models — standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large for multi-language pipeline. XLM-RoBERTa-large strong multilingual backbone with good quality on ru/uk/de/fr.
Fine-tuning for classification: add classification head atop [CLS] token, train 3-5 epochs with lr=2e-5, weight_decay=0.01. With class imbalance: weighted CrossEntropyLoss or focal loss with gamma=2.0.
Case with imbalance. Above example: 3000 examples, 1:20 imbalance on rare classes. Solution: class_weight via sklearn compute_class_weight('balanced', ...) fed to CrossEntropyLoss. Additionally — augment rare classes via backtranslation (ru → en → ru via Google Translate API or MarianMT) or LLM paraphrase. Result: rare class recall rose 0.31 → 0.67 with minor accuracy drop (76% → 74%).
Distillation for production. BERT-large achieves F1 0.89 but CPU inference — 180ms. Distillation to DistilBERT or ruBERT-tiny2 (DeepPavlov) cuts latency to 25ms at F1 0.84. For most classification tasks — acceptable tradeoff. If 25ms still high — ONNX Runtime export gives additional 1.5-2x.
NER: Named Entity Recognition
NER — extract entities from text: persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC) pretrained models work well. For specialized (medical terms, legal concepts, engineering names) — fine-tuning required.
Data annotation. Main NER project cost. For quality model need 500-2000 annotated sentences per entity type. Tools: Label Studio (open source), Prodigy (paid, from spaCy creators, excellent UX). IOB2 format — standard.
Architecture. Token classification atop BERT: each token gets label (B-PER, I-PER, O etc). spaCy 3.x with transformer pipeline — good production choice: convenient API, built-in serving, custom components support.
Nested entities. Standard IOB models don't handle nested entities (organization within address). For such — span-based NER: SpanBERT or SpERT family. Implementation more complex but solves task properly.
Post-processing mandatory. Model predicts tokens, need normalized entities. For dates — dateparser normalization. For amounts — regex + validation. For names — deduplication via fuzzy matching (rapidfuzz).
Sentiment Analysis and Opinion Mining
Binary positive/negative classification works with BERT out-of-box. Real complexity — aspect-based sentiment (ABSA): "good kitchen, awful service at restaurant."
For ABSA: task splits into aspect extraction (NER) + sentiment per aspect. Or joint models like BERT-for-ABSA trained end-to-end. Russian data quality lower than English — fewer annotated datasets. RuSentiment, SentiRuEval — main Russian resources.
For simple positive/negative/neutral production: distil-models sufficient. Three classes, balanced dataset, 2000+ examples — F1 macro 0.82-0.87 achievable in 1-2 days.
Text Summarization
Extractive summarization (select sentences from text) — via TextRank or BM25-based scoring without training. Fast, predictable, no hallucinations. Good for long documents needing quick summary.
Abstractive summarization (generates new text) — seq2seq models. mT5, mBART for multilingual. FRED-T5 and ruT5-large — specialized Russian models for summarization. For production deployments via LLM API (GPT-4, Claude) — often best cost/quality/speed tradeoff.
Embeddings: Vector Representations of Text
Embeddings — foundation for semantic search, deduplication, clustering, RAG. Embedding quality critically affects downstream tasks.
Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast variant with lower quality. For Russian: ru-en-RoSBERTa (Skoltech) good semantic textual similarity results.
Embedding quality evaluation. MTEB benchmark — standard for comparison. Check Retrieval, STS, Clustering subtasks. Important: top MTEB results not always best on your domain. Always build domain-specific eval.
Embedding fine-tuning. If standard models insufficient — contrastive learning on domain pairs (positive and negative). sentence-transformers supports MultipleNegativesRankingLoss. 500-2000 pairs, 1-3 epochs often yield 5-15% Recall@k gain on domain data.
Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. At 10M documents — 40GB index. INT8 quantization cuts to 10GB with negligible quality loss. FAISS IVF_PQ even more compact but recall tradeoffs.
Information Extraction and Document Parsing
Structured extraction from unstructured text — one most frequent tasks. Examples: extract key contract terms, parse product specs from description, pull dates and amounts from invoices.
Approach 1: regex + rule-based. For standard fields with predictable format (INN, OGRN, amounts, dates) — more reliable than neural. No data required, predictable.
Approach 2: NER + post-processing. For entities with variable format.
Approach 3: LLM with structured output. GPT-4 / Claude with JSON schema — complex unstructured documents where rule-based fails. Cost: ~$0.001-0.01 per document depending volume and model. For 10k+ documents/day — need to calculate economics.
Production usually hybrid: regex/NER for standard fields + LLM for edge cases and complex structures.
Workflow
Data analysis and baseline metrics. Check class distribution, text length, annotation quality, language composition. Define metrics: F1 macro for imbalance, Precision@k for ranking, Recall@k for search.
Baseline. TF-IDF + LogReg or CatBoost on bag-of-words — daily baseline. BERT fine-tuning gap shows semantic context importance.
Training and validation. k-fold cross-validation (k=5), early stopping by F1 on validation. Manual error analysis yields improvement ideas.
Deployment. ONNX Runtime for CPU inference. FastAPI + uvicorn — standard REST API. Request batching if throughput needed.
Prototype on existing data — 1-3 weeks. Production system with CI/CD and monitoring — 1.5-2.5 months.







