NLP & Text Processing Solution Development

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 60 of 149 servicesAll 1566 services
Medium
~5 business days
Medium
~5 business days
Simple
~3-5 business days
Simple
~3-5 business days
Simple
~3-5 business days
Simple
~3-5 business days
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

NLP: Text Classification, NER, Embeddings, and Information Extraction

Task arrives: "classify support requests — around 50,000 per month, currently all manual." 3,000 labeled examples, 12 categories, imbalance: one category 40% of set, three categories 1-2% each. Baseline accuracy 78%. Looks decent until you check rare class recall: 0.31, 0.44, 0.28. These classes — complaints and churn threats — matter most to business.

This is typical NLP project. Problem is not algorithm, it's that accuracy wrong metric.

Text Classification: From BERT to Distillation

BERT-like models — standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large for multi-language pipeline. XLM-RoBERTa-large strong multilingual backbone with good quality on ru/uk/de/fr.

Fine-tuning for classification: add classification head atop [CLS] token, train 3-5 epochs with lr=2e-5, weight_decay=0.01. With class imbalance: weighted CrossEntropyLoss or focal loss with gamma=2.0.

Case with imbalance. Above example: 3000 examples, 1:20 imbalance on rare classes. Solution: class_weight via sklearn compute_class_weight('balanced', ...) fed to CrossEntropyLoss. Additionally — augment rare classes via backtranslation (ru → en → ru via Google Translate API or MarianMT) or LLM paraphrase. Result: rare class recall rose 0.31 → 0.67 with minor accuracy drop (76% → 74%).

Distillation for production. BERT-large achieves F1 0.89 but CPU inference — 180ms. Distillation to DistilBERT or ruBERT-tiny2 (DeepPavlov) cuts latency to 25ms at F1 0.84. For most classification tasks — acceptable tradeoff. If 25ms still high — ONNX Runtime export gives additional 1.5-2x.

NER: Named Entity Recognition

NER — extract entities from text: persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC) pretrained models work well. For specialized (medical terms, legal concepts, engineering names) — fine-tuning required.

Data annotation. Main NER project cost. For quality model need 500-2000 annotated sentences per entity type. Tools: Label Studio (open source), Prodigy (paid, from spaCy creators, excellent UX). IOB2 format — standard.

Architecture. Token classification atop BERT: each token gets label (B-PER, I-PER, O etc). spaCy 3.x with transformer pipeline — good production choice: convenient API, built-in serving, custom components support.

Nested entities. Standard IOB models don't handle nested entities (organization within address). For such — span-based NER: SpanBERT or SpERT family. Implementation more complex but solves task properly.

Post-processing mandatory. Model predicts tokens, need normalized entities. For dates — dateparser normalization. For amounts — regex + validation. For names — deduplication via fuzzy matching (rapidfuzz).

Sentiment Analysis and Opinion Mining

Binary positive/negative classification works with BERT out-of-box. Real complexity — aspect-based sentiment (ABSA): "good kitchen, awful service at restaurant."

For ABSA: task splits into aspect extraction (NER) + sentiment per aspect. Or joint models like BERT-for-ABSA trained end-to-end. Russian data quality lower than English — fewer annotated datasets. RuSentiment, SentiRuEval — main Russian resources.

For simple positive/negative/neutral production: distil-models sufficient. Three classes, balanced dataset, 2000+ examples — F1 macro 0.82-0.87 achievable in 1-2 days.

Text Summarization

Extractive summarization (select sentences from text) — via TextRank or BM25-based scoring without training. Fast, predictable, no hallucinations. Good for long documents needing quick summary.

Abstractive summarization (generates new text) — seq2seq models. mT5, mBART for multilingual. FRED-T5 and ruT5-large — specialized Russian models for summarization. For production deployments via LLM API (GPT-4, Claude) — often best cost/quality/speed tradeoff.

Embeddings: Vector Representations of Text

Embeddings — foundation for semantic search, deduplication, clustering, RAG. Embedding quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast variant with lower quality. For Russian: ru-en-RoSBERTa (Skoltech) good semantic textual similarity results.

Embedding quality evaluation. MTEB benchmark — standard for comparison. Check Retrieval, STS, Clustering subtasks. Important: top MTEB results not always best on your domain. Always build domain-specific eval.

Embedding fine-tuning. If standard models insufficient — contrastive learning on domain pairs (positive and negative). sentence-transformers supports MultipleNegativesRankingLoss. 500-2000 pairs, 1-3 epochs often yield 5-15% Recall@k gain on domain data.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. At 10M documents — 40GB index. INT8 quantization cuts to 10GB with negligible quality loss. FAISS IVF_PQ even more compact but recall tradeoffs.

Information Extraction and Document Parsing

Structured extraction from unstructured text — one most frequent tasks. Examples: extract key contract terms, parse product specs from description, pull dates and amounts from invoices.

Approach 1: regex + rule-based. For standard fields with predictable format (INN, OGRN, amounts, dates) — more reliable than neural. No data required, predictable.

Approach 2: NER + post-processing. For entities with variable format.

Approach 3: LLM with structured output. GPT-4 / Claude with JSON schema — complex unstructured documents where rule-based fails. Cost: ~$0.001-0.01 per document depending volume and model. For 10k+ documents/day — need to calculate economics.

Production usually hybrid: regex/NER for standard fields + LLM for edge cases and complex structures.

Workflow

Data analysis and baseline metrics. Check class distribution, text length, annotation quality, language composition. Define metrics: F1 macro for imbalance, Precision@k for ranking, Recall@k for search.

Baseline. TF-IDF + LogReg or CatBoost on bag-of-words — daily baseline. BERT fine-tuning gap shows semantic context importance.

Training and validation. k-fold cross-validation (k=5), early stopping by F1 on validation. Manual error analysis yields improvement ideas.

Deployment. ONNX Runtime for CPU inference. FastAPI + uvicorn — standard REST API. Request batching if throughput needed.

Prototype on existing data — 1-3 weeks. Production system with CI/CD and monitoring — 1.5-2.5 months.