Development of AI System for Automatic Resume Parsing from Job Sites
Mass resume parsing from hh.ru, Superjob, Rabota.ru allows automatic population of candidate database without manual search. The system collects, normalizes, and structures data from different sources.
API vs Parsing
For Russia: hh.ru and SuperJob have official APIs for employers. This is the preferred path — official, reliable, and doesn't violate ToS.
- hh.ru API: resume search endpoint, detailed resume data. "Resume Database Access" plan from 5000 RUB/month
- SuperJob API: similar functionality
- Rabota.ru: parsing (API only for partners)
Data Normalization from Different Sources
Each job site has its own data structure. Normalization to a unified schema:
class NormalizedResume(BaseModel):
source: str # "hh.ru" | "superjob" | "rabota.ru"
source_id: str # ID on source
full_name: str
age: int | None
city: str | None
desired_position: str
desired_salary: int | None
currency: str
experience: list[WorkExperience]
education: list[Education]
skills: list[str] # normalized skills
languages: list[LanguageSkill]
last_updated: datetime
# AI enrichment
seniority_level: str # junior/middle/senior/lead — AI assessment
tech_stack: list[str] # technology stack — extracted by AI
experience_years: float # total experience
Candidate Deduplication
One person may post resumes on multiple sites. Deduplication through:
- Phone/email matching (if visible)
- Semantic similarity of work experience (embeddings)
- Fuzzy matching by name + city + current employer
Rule: at similarity > 0.85 — suggest merge, at > 0.95 — automatic merge.
Candidate Database Updates
Resumes become outdated. Update triggers: candidate updates resume on source (webhook/periodic poll), 30 days passed without changes — verify relevance, candidate applies to job posting — priority update.







