What entity types does a base NER model support?

The base model (CoNLL-2003) tags persons (PER), organizations (ORG), locations (LOC), and miscellaneous (MISC). This is often insufficient for business needs, so we fine-tune the model for your domain: finance, medicine, law, or logistics.

Which tool is best for NER in Russian?

For basic tasks, natasha is lightweight and works out-of-the-box. For production with high accuracy, spaCy with ru_core_news_lg. If maximum accuracy on complex texts is needed, BERT-based models (DeepPavlov/rubert-base-cased-ner). We choose the stack based on your data and latency requirements.

How many annotated examples are needed for fine-tuning?

We recommend at least 200–500 examples per custom entity type. For stable quality (F1 > 90%), 1000+ examples per type are required. We perform annotation in Label Studio or Prodigy and control quality via inter-annotator agreement (IAA).

What metric is used to evaluate NER?

The main metric is Entity-level F1 strict. An entity is considered correctly recognized only if both the span boundaries and type match exactly. Partial matches are not counted. Typical values: PER 95–97%, ORG 88–93%, LOC 90–95%, custom entities 80–90% after fine-tuning.

Is post-deployment model support included?

Yes. We provide access to the trained model, pipeline code, retraining documentation, and a guide. We offer 3 months of warranty support: bug fixes and assistance with additional annotation if needed.

What entity types does a base NER model support?

The base model (CoNLL-2003) tags persons (PER), organizations (ORG), locations (LOC), and miscellaneous (MISC). This is often insufficient for business needs, so we fine-tune the model for your domain: finance, medicine, law, or logistics.

Which tool is best for NER in Russian?

For basic tasks, natasha is lightweight and works out-of-the-box. For production with high accuracy, spaCy with ru_core_news_lg. If maximum accuracy on complex texts is needed, BERT-based models (DeepPavlov/rubert-base-cased-ner). We choose the stack based on your data and latency requirements.

How many annotated examples are needed for fine-tuning?

We recommend at least 200–500 examples per custom entity type. For stable quality (F1 > 90%), 1000+ examples per type are required. We perform annotation in Label Studio or Prodigy and control quality via inter-annotator agreement (IAA).

What metric is used to evaluate NER?

The main metric is Entity-level F1 strict. An entity is considered correctly recognized only if both the span boundaries and type match exactly. Partial matches are not counted. Typical values: PER 95–97%, ORG 88–93%, LOC 90–95%, custom entities 80–90% after fine-tuning.

Is post-deployment model support included?

Yes. We provide access to the trained model, pipeline code, retraining documentation, and a guide. We offer 3 months of warranty support: bug fixes and assistance with additional annotation if needed.

Custom Named Entity Recognition (NER) Solutions for Business

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Custom Named Entity Recognition (NER) Solutions for Business

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

We develop custom named entity recognition (NER) systems for business domains such as finance, medicine, and law. Our solutions extract domain-specific entities with high accuracy through fine-tuning pre-trained models on your data. For a comprehensive overview, see the Wikipedia article on NER. Standard models (CoNLL-2003) only recognize persons, organizations, locations — clearly insufficient for domain-specific vocabulary. Typical business tasks: extracting financial indicators from reports, identifying drugs in prescriptions, recognizing legal entities in contracts. Each requires a tailored approach. Our fine-tuning NER pipeline using ruBERT NER or spaCy NER achieves high Entity F1 scores by employing proper BIO tagging.

We implement NER end-to-end: from architecture selection to deployment in your infrastructure. Let's break down how to set up entity recognition in Russian, compare tools, and show why fine-tuning pays off.

Standard entity types and their extension

The basic set (CoNLL-2003): PER (persons), ORG (organizations), LOC (locations), MISC (other). This is often insufficient for business. Typical custom types:

Finance: MONEY, PERCENT, DATE, TICKER, FINANCIAL_INSTRUMENT
Medicine: DISEASE, DRUG, DOSAGE, PROCEDURE, ANATOMY
Law: LAW, COURT, CASE_NUMBER, LEGAL_ENTITY
Logistics: ADDRESS, POSTAL_CODE, VEHICLE_ID, CARGO

How to choose a tool for NER in Russian?

natasha — optimal for a quick start:

from natasha import Segmenter, MorphVocab, NewsEmbedding, NewsNERTagger, Doc

segmenter = Segmenter()
emb = NewsEmbedding()
ner_tagger = NewsNERTagger(emb)

doc = Doc("Газпром подписал контракт с немецкой компанией Wintershall в Берлине.")
doc.segment(segmenter)
doc.tag_ner(ner_tagger)
# [(Газпром, ORG), (Wintershall, ORG), (Берлине, LOC)]

spaCy (ru_core_news_lg): balance of speed and quality, easily integrates into production pipelines.

BERT-based (DeepPavlov, HuggingFace): maximum accuracy on complex texts, but higher latency.

Compare tools in the table:

Tool	Accuracy (F1)	Speed (ms/sentence)	Custom entities
natasha	85–90%	~2 ms	no (only basic)
spaCy	88–93%	~5 ms	fine-tuning via prodigy
ruBERT	92–97%	~30 ms (CPU)	fine-tuning via HF

Why fine-tuning improves accuracy?

Fine-tuning on your corpus removes homonymy and lifts F1 by 10–15%. A pre-trained model confuses 'Apple' (fruit) and 'Apple' (company). After fine-tuning, the model considers context, which is critical for domain-specific vocabulary. The process includes annotation, IOB2 formatting, and training via HuggingFace.

Scenario	F1 before fine-tuning	F1 after fine-tuning
Medical terms	75%	91%
Legal entities	68%	88%

Fine-tuning for custom entities

Process:

Annotation: using Prodigy or Label Studio. Minimum 200–500 examples per entity type.
Format: IOB2 (BIO-tagging) — standard for NER.
Training: HuggingFace TokenClassification with pre-trained RuBERT.

from transformers import AutoModelForTokenClassification, TrainingArguments
model = AutoModelForTokenClassification.from_pretrained(
    "DeepPavlov/rubert-base-cased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

NER quality evaluation

Entity-level F1 (strict) — the main metric. 'Strict' means correct type AND correct span boundaries. Partial match is considered an error.

Typical scores on Russian texts:

PER: F1 95–97% (easily recognizable patterns)
ORG: F1 88–93% (many abbreviations)
LOC: F1 90–95%
Custom domain entities: 80–90% after fine-tuning on 1K+ examples

How we implement NER for your domain?

Our approach is engineering-driven, no black boxes. Work stages:

Data analysis: study your texts, define target entities, assess complexity (nesting, homonymy, discontiguous entities).
Annotation and augmentation: prepare corpus in IOB2, control quality via cross-validation of annotators.
Architecture selection: compare natasha/spaCy/BERT on your corpus, choose by F1 and latency.
Fine-tuning and testing: train model, achieve F1 > 90% on target types.
Deployment: package into ONNX (CPU) or TorchServe (GPU), latency from 5 to 30 ms per sentence.
Handover: documentation, pipeline code, model access, training for your team.

What is included in our work

You receive:

Trained model with custom entities
Inference code (Python, Docker image)
Instructions for retraining on new data
3-month warranty support

For each project, we prepare a guideline for annotators, conduct a test round with quality control (inter-annotator agreement > 0.9). After scheme approval, we start full annotation.

Complex cases

Nested entities: 'Министерство финансов России' — ORG + LOC. Most models do not support nesting; we use Span-BERT or biaffine NER.
Discontiguous entities: 'ООО… (далее — Компания)' — requires a coreference module.
Homonymy: resolved by context (transformers handle better than CRF).

Deployment and reliability

We guarantee stable model operation under load. With over 5 years of experience in NLP and more than 50 successful projects, we are a trusted partner for custom NER development. Our team includes certified PyTorch engineers, ensuring reliable deployment.

Deliverables

Our deliverables include:

Comprehensive documentation of the model architecture and training process
Access to the trained model and inference pipeline
Retraining scripts and guidelines for updating the model with new data
3 months of post-deployment support and troubleshooting
Option for ongoing annotation and model maintenance

The cost of a typical NER project starts from $5,000 for a single entity type and scales up to $15,000 for comprehensive multi-entity systems. This investment often yields 80% reduction in manual data processing costs.

Order development of a custom NER system for your domain. Get a consultation on tool selection and required data volume. For a preliminary assessment of your case, write to us — we will prepare a detailed work plan.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.