Named Entity Recognition (NER) Implementation
NER (Named Entity Recognition)—task of recognizing and classifying entities mentioned in text: persons, organizations, locations, dates, monetary amounts, products. Fundamental component of most text processing systems.
Standard Entity Types and Extensions
Base types (CoNLL-2003 standard): PER (persons), ORG (organizations), LOC (locations), MISC (miscellaneous).
For business applications, standard set is insufficient. Typical extensions:
-
Finance:
MONEY,PERCENT,DATE,TICKER,FINANCIAL_INSTRUMENT -
Medicine:
DISEASE,DRUG,DOSAGE,PROCEDURE,ANATOMY -
Law:
LAW,COURT,CASE_NUMBER,LEGAL_ENTITY -
Logistics:
ADDRESS,POSTAL_CODE,VEHICLE_ID,CARGO
Tools for Russian NER
natasha—best choice for basic Russian tasks:
from natasha import Segmenter, MorphVocab, NewsEmbedding, NewsNERTagger, Doc
segmenter = Segmenter()
emb = NewsEmbedding()
ner_tagger = NewsNERTagger(emb)
doc = Doc("Gazprom signed contract with German company Wintershall in Berlin.")
doc.segment(segmenter)
doc.tag_ner(ner_tagger)
# [(Gazprom, ORG), (Wintershall, ORG), (Berlin, LOC)]
spaCy with Russian model (ru_core_news_lg): good speed-quality balance, integration into production pipelines.
BERT-based (DeepPavlov, Hugging Face): DeepPavlov/rubert-base-cased-ner—for high quality on complex texts.
Fine-tuning for Custom Entities
For custom entity types, you need own corpus and fine-tuning:
- Annotation: Prodigy, Label Studio, or Doccano. Minimum 200–500 examples per entity type
- Format: IOB2 (BIO-tagging)—NER standard
- Training: HuggingFace TokenClassification with pretrained RuBERT
from transformers import AutoModelForTokenClassification, TrainingArguments
model = AutoModelForTokenClassification.from_pretrained(
"DeepPavlov/rubert-base-cased",
num_labels=len(label_list),
id2label=id2label,
label2id=label2id
)
NER Quality Assessment
Entity-level F1 (strict)—main metric. "Strict" means: correct type AND correct span boundaries. Partial match counts as error.
Typical Russian text scores:
- PER: F1 95–97% (easily recognizable patterns)
- ORG: F1 88–93% (many abbreviations, acronyms)
- LOC: F1 90–95%
- Custom domain entities: 80–90% after fine-tuning on 1K+ examples
Complex Cases
- Nested entities: "Ministry of Finance of Russia"—(ORG + LOC). Most standard models don't support nesting; needs specialized architectures (Span-BERT, biaffine NER)
- Scattered entities: "LLC... (hereinafter—Company)"—coreference requires separate module
- Ambiguity: "Apple"—company or fruit? Resolved via context (transformers handle well)
Deployment
spaCy: export to .spacy format, serving via FastAPI. BERT: ONNX export for CPU, TorchServe for GPU. Latency: spaCy CPU ~5ms/sentence, BERT ONNX CPU ~30ms/sentence.







