What data can be extracted from a medical record?

Diagnoses (ICD-10), medications with dosages, procedures, lab results, allergies. The model handles both structured and narrative text parts.

Is on-premise deployment mandatory?

For compliance with 152-FZ (special categories of personal data) — yes. The model is installed on your server or in a Russian cloud. No data transfer abroad.

How long does implementation take?

From 6 to 8 months depending on the number of entity types and the complexity of the MIS. Includes corpus annotation, model training, integration, and pilot.

How do you handle LLM hallucinations?

For data extraction we use a BERT classifier, not an LLM. LLM is only used for summary generation under human oversight. Manual validation with F1 > 95% is mandatory.

Does the project include training for doctors?

Yes, in the final stage we train staff on the system, prepare documentation, and hand over the model source code.

What data can be extracted from a medical record?

Diagnoses (ICD-10), medications with dosages, procedures, lab results, allergies. The model handles both structured and narrative text parts.

Is on-premise deployment mandatory?

For compliance with 152-FZ (special categories of personal data) — yes. The model is installed on your server or in a Russian cloud. No data transfer abroad.

How long does implementation take?

From 6 to 8 months depending on the number of entity types and the complexity of the MIS. Includes corpus annotation, model training, integration, and pilot.

How do you handle LLM hallucinations?

For data extraction we use a BERT classifier, not an LLM. LLM is only used for summary generation under human oversight. Manual validation with F1 > 95% is mandatory.

Does the project include training for doctors?

Yes, in the final stage we train staff on the system, prepare documentation, and hand over the model source code.

Extract Data from Medical Records: On-Premise Medical NLP

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Extract Data from Medical Records: On-Premise Medical NLP

Complex

~2-4 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1189
B2B Advance company logo design
646
Development of a web application for Enviok
930

Show more works

Extract Data from Medical Records: Medical NLP

Physicians spend up to 30% of their working time on paperwork, yet key clinical data remains in narrative text. We turn these texts into structured tables: diagnoses, medications, lab results, procedures. In 6–8 months we build a fully custom NLP pipeline for your MIS. The model runs on-premise, data security complies with 152-FZ. Contact us for a free consultation and project evaluation.

How we extract data from medical records using NLP

We use a combination of methods: Clinical NER based on BERT, assertion detection for context, temporal reasoning for chronology, and term normalization to standard classifiers. Each stage is optimized for Russian-language medical texts.

What problems we solve

Clinical NER: not just word search, but context

Standard regex cannot distinguish "angina" in a list of past diseases from the current diagnosis. We use Clinical BERT, fine-tuned on a labeled corpus of 10,000+ sentences from physicians. Entity-level F1 is 94% for the most common entities (diagnosis, medication, dosage).

Assertion detection: where is the truth?

The note "no chest pain" is the opposite of "angina". The model identifies four contexts: present, absent, uncertain, history. An error here is critical — we achieve 97% accuracy on the test set.

Temporal reasoning: treatment chronology

Physicians care about chronology: when was diabetes detected, when was Metformin prescribed, did the dosage change? Without a timeline, analytics is useless. We build an event timeline using a CRF layer on top of BERT.

Term normalization: from text to code

"SD2", "type 2 diabetes", "Diabetes mellitus type 2" — all the same. Mapping to ICD-10 (E11) and UMLS CUI (C0011860) gives a unified picture across the clinic.

How we do it

We use DeepPavlov/rubert-base-cased as the base model. We fine-tune on a corpus of discharge notes (5,000 documents labeled by physicians). Inference is on an on-premise cluster of 4× A100 — p99 latency 120 ms per medical record.

Why on-premise is mandatory

Medical data is a special category (152-FZ, Article 10). Transfer to a foreign vendor's cloud is impossible without de-identification. We deploy the model on your server or in a Russian cloud (Yandex Cloud, SberCloud). No cross-border data transfer.

How we fight LLM hallucinations

For safety-critical tasks we follow the rule: LLM only for summary generation, not for extraction. NER is done with a separate BERT classifier. Manual audit is mandatory until F1 > 95%.

What works better: BERT or LLM for extraction?

Criterion	BERT (token classification)	LLM (prompting)
F1 score	94–97%	70–85% (model-dependent)
Latency per document	120 ms	2–5 s
Inference cost	Low	High (tokens)
Hallucinations	None	Yes (up to 15% on rare terms)
152-FZ compliance	Yes (on-premise)	Requires fine-tuning

BERT outperforms LLM by an average of 20% in accuracy and 30x in speed. Therefore, we use BERT for data extraction and LLM only for human-supervised summary generation.

What is included in the work

The implementation process consists of five key stages:

Corpus labeling by physicians (1–2 months) — you get a labeled corpus of 5,000 documents and a baseline NER model with F1 ~90%.
Full pipeline training (3–4 months) — NER + assertion + normalization + de-identification. A ready Docker image with REST API.
Integration with MIS via HL7 FHIR or REST (5–6 months) — pilot on 100 medical records.
Deployment of CDSS module and dashboards (7–8 months) — drug interaction checks, analytics.
Staff training and documentation — handover of model source code.

Stage	What you get
1–2 months	Labeled corpus of 5,000 documents, baseline NER model F1 ~90%
3–4 months	Full pipeline: NER + assertion + normalization + de-identification. Docker image with REST API
5–6 months	Integration with MIS (HL7 FHIR or REST setup), pilot on 100 records
7–8 months	CDSS module (drug interaction checks), analytics dashboard, documentation, staff training

Technical requirements

OS: Ubuntu 20.04+
RAM: 64 GB+
GPU: 4× NVIDIA A100 80GB
CUDA 11.8
Docker 20.10+

Typical mistakes and how to avoid them

Using ready-made models for English: on Russian texts F1 drops by 30–40%. Fine-tuning is required.
Ignoring de-identification: without it, data cannot be used for training and analytics.
Relying solely on LLMs: they hallucinate on rare terms — human-in-the-loop is mandatory.

Quality assessment

Metrics: F1 for each entity type, assertion detection accuracy, drug-dosage linkage accuracy. For safety-critical applications — human validation with F1 > 95%.

Our expertise: 5 years of experience in medical NLP, 20+ implementations in private and public clinics. Certified specialists (NVIDIA DLI, Bioinformatics).

Timeline and cost

Timeline: from 6 to 8 months depending on MIS complexity and number of entity types. Cost is calculated individually. Typical savings from automation: 3 to 8 million rubles per year per clinic. Get a free assessment of your project — request a consultation. You keep the model, code, documentation, and trained physicians.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.