Which job sites does the system support?

The system works with hh.ru (via API or parsing), SuperJob (API), Rabota.ru (parsing), and LinkedIn (parsing). We can add other sources on request.

How do you handle duplicate resumes?

We use multi-level deduplication: exact contact match (phone, email), semantic similarity of experience text via embeddings (threshold 0.85), and fuzzy matching by name+city+employer. When similarity > 0.95, merging is automatic.

What does AI enrichment of a resume include?

An AI model (based on GPT-4o or LLaMA 3) determines the candidate's grade (junior/middle/senior/lead), extracts the tech stack, calculates total experience in years, and adds missing skills from context. Enrichment accuracy exceeds 90%.

How are outdated resumes updated?

The system checks each resume every 30 days for freshness. It also responds to webhook notifications from job sites (if available) and candidate applications—in that case, priority update is triggered.

How long does system implementation take?

Basic implementation (2 sources + ATS) takes 2 to 4 weeks. Extended version with AI enrichment and custom rules takes up to 8 weeks. We provide support and further tweaks after launch.

Which job sites does the system support?

The system works with hh.ru (via API or parsing), SuperJob (API), Rabota.ru (parsing), and LinkedIn (parsing). We can add other sources on request.

How do you handle duplicate resumes?

We use multi-level deduplication: exact contact match (phone, email), semantic similarity of experience text via embeddings (threshold 0.85), and fuzzy matching by name+city+employer. When similarity > 0.95, merging is automatic.

What does AI enrichment of a resume include?

An AI model (based on GPT-4o or LLaMA 3) determines the candidate's grade (junior/middle/senior/lead), extracts the tech stack, calculates total experience in years, and adds missing skills from context. Enrichment accuracy exceeds 90%.

How are outdated resumes updated?

The system checks each resume every 30 days for freshness. It also responds to webhook notifications from job sites (if available) and candidate applications—in that case, priority update is triggered.

How long does system implementation take?

Basic implementation (2 sources + ATS) takes 2 to 4 weeks. Extended version with AI enrichment and custom rules takes up to 8 weeks. We provide support and further tweaks after launch.

AI-Powered Resume Parsing from Job Sites: hh.ru, SuperJob, Rabota.ru

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI-Powered Resume Parsing from Job Sites: hh.ru, SuperJob, Rabota.ru

Simple

~2-3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1360
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

Mass resume parsing from hh.ru, SuperJob, Rabota.ru, and LinkedIn is a task we solve turnkey. The system processes thousands of new resumes daily, reducing manual recruiter work by 90%. Instead of copying data from 3–4 sources, you get a single, automatically updated candidate database with AI enrichment: grade (junior/middle/senior/lead), tech stack, total experience in years. Reducing costs by up to 80% saves over 500,000 RUB per year per every 1000 resumes. Using official APIs reduces the cost of collecting one resume to 0.1 RUB versus 10 RUB for manual copying. Below we cover the technical details: how to avoid blocks during parsing, normalize heterogeneous data schemas, and not drown in duplicates. All solutions comply with robots.txt and official APIs.

API vs Parsing: Strategy Choice

Criterion	Official API	Parsing (HTML scraping)
Reliability	High, not blocked	Medium, requires anti-bypass
Speed	High (up to 1000 requests/min)	Low (≤5 requests/sec)
Data Completeness	Full structured info	Only visible, captchas possible
Legal Safety	Allowed by ToS	Gray areas, IP ban risk
Cost	Paid	Free but resource-heavy

For Russia: hh.ru and SuperJob have official API for employers. We recommend starting with them. We use parsing only for Rabota.ru and LinkedIn, where APIs are absent or limited. When using API, the cost per resume is minimal, and reliability is 10 times higher than HTML scraping.

How to Reduce Parsing Block Risks?

For LinkedIn and Rabota.ru, we use Playwright with user-agent rotation and proxies. Once on a project with 500 resumes per day, we encountered a captcha on Rabota.ru—we had to integrate an image recognition service. After adaptation, parsing stability reached 98%.

Data Normalization: Key to a Unified Database

Each job site returns data in its own format. Without normalization, you cannot merge resumes into a single database. We convert all resumes to a unified schema using Pydantic:

class NormalizedResume(BaseModel):
    source: str                  # "hh.ru" | "superjob" | "rabota.ru"
    source_id: str               # ID on the source
    full_name: str
    age: int | None
    city: str | None
    desired_position: str
    desired_salary: int | None
    currency: str

    experience: list[WorkExperience]
    education: list[Education]
    skills: list[str]            # normalized skills
    languages: list[LanguageSkill]
    last_updated: datetime

    # AI enrichment
    seniority_level: str         # junior/middle/senior/lead — AI estimate
    tech_stack: list[str]        # tech stack — AI extracted
    experience_years: float      # total experience

Example of a normalized resume

{
  "source": "hh.ru",
  "source_id": "123456",
  "full_name": "Ivan Ivanov",
  "age": 30,
  "city": "Moscow",
  "desired_position": "Python Developer",
  "desired_salary": 200000,
  "currency": "RUB",
  "experience": [
    {
      "company": "LLC Technologies",
      "position": "Senior Python developer",
      "start_date": "2020-01",
      "end_date": "2023-06",
      "description": "Backend development on FastAPI"
    }
  ],
  "education": [
    {
      "institution": "Moscow State University",
      "degree": "Bachelor",
      "field": "Applied Mathematics",
      "graduation_year": 2016
    }
  ],
  "skills": ["Python", "FastAPI", "PostgreSQL"],
  "languages": [{"language": "English", "level": "B2"}],
  "last_updated": "2025-02-01T10:00:00",
  "seniority_level": "senior",
  "tech_stack": ["Python", "FastAPI", "PostgreSQL", "Docker"],
  "experience_years": 8.5
}

What Does AI Enrichment of Resumes Provide?

The AI model (GPT-4o or LLaMA 3) determines the grade and technologies on the fly—it is 1.4 times more accurate than manual tagging. Recruiter time savings amount to up to 80% per resume.

Candidate Deduplication: Three-Level Method

One candidate often posts resumes on 2–3 sites. Our system detects duplicates using a three-level method:

Method	Basis	Accuracy	Action on Match
Exact contact match	Phone/email (if public)	100%	Automatic merge
Semantic similarity	Embeddings `intfloat/multilingual-e5-large`	>0.85	Suggest merge
Fuzzy matching	Name + city + current employer (Levenshtein distance)	>0.95	Automatic merge

At similarity > 0.85, we suggest merging; at > 0.95, we merge automatically. This eliminates up to 95% of duplicates without data loss.

Trigger-Based Resume Database Update

Resumes become outdated, so the system updates them based on triggers:

Candidate updates resume on the source (webhook or periodic poll every hour).
30 days without changes—background reparsing.
Candidate applies for a vacancy—priority update.

This approach ensures database freshness with a maximum delay of 1 hour.

Implementation Stages of Resume Parsing System

Analysis: determine sources, data volumes, ATS requirements. Collect sample resumes for testing.
Design: choose between API and parsing, design normalization schema, deduplication and enrichment pipeline.
Implementation: write parsers (Scrapy/Playwright), connect AI model, set up deduplication and ATS integration.
Testing: run on test data, check extraction accuracy, speed, and reliability.
Deployment: deploy on servers (Docker, Kubernetes), set up monitoring (Grafana, Prometheus) and CI/CD.

What's Included in the Work

Documentation: architecture description, data schemas, operation manual.
Access: to backend (FastAPI), admin panel, Grafana metrics.
Training: 2 sessions for your team (administration and rule configuration).
Support: 2 weeks after launch + 6-month code warranty.

On request, we also add custom enrichment rules: for example, extracting certificates, projects, or soft skills via few-shot prompts for the LLM. We test on a sample of 100 resumes.

Get a consultation on your project today—we will evaluate your project in 1 day and propose the best solution. Order a turnkey system development and automate personnel recruitment.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.