Developing AI System for Disease Risk Prediction
Predictive medicine — shift from treatment to prevention. AI risk models allow intervention before disease appears, when preventive measures most effective and inexpensive.
Risk Prediction Tasks
Population Screening Identifying high-risk patients among entire attached population for active invitation to screening. Applications: type 2 diabetes, cardiovascular disease, cancer, chronic kidney disease.
Individual Prediction 10-year risk of cardiovascular event (Framingham, SCORE2 — classical models vs. ML). ML models exceed classical risk scores through:
- Non-linear feature interactions
- Greater number of predictors
- Training on local population data
Disease Progression Patient with early stage — when will shift to severe? Diabetic — risk of nephropathy/retinopathy. Survival models (Cox PH, Random Survival Forest, DeepHit) with time-to-event endpoints.
Data Sources
Structured EHR Data
- Diagnoses (ICD-10 codes), procedures (procedure codes)
- Laboratory data: glucose, HbA1c, lipids, CBC, biochemistry
- Medication prescriptions
- Vital signs from visits
- Demographics
Genomic Data SNPs (single nucleotide polymorphisms) for polygenic risk scores. BRCA1/2 for breast cancer, ApoE4 for Alzheimer's disease, PCSK9 for cardiovascular disease. Polygenic risk score (PRS) = weighted sum of thousands of SNPs. ML task: optimal weighting for specific population.
Lifestyle and Social Factors Smoking, alcohol, physical activity, body mass index, diet, psychosocial stress, education level, healthcare access. From EMR, questionnaires, wearables.
Models and Validation
For Tabular EHR Data XGBoost and LightGBM — dominant approaches on real medical data. Advantages: handling missing values, interpretability via SHAP, good performance on small datasets.
For Time Series (Longitudinal EHR) Transformer-based models (BERT on medical codes: BEHRT, Med-BERT). Patient = sequence of medical events over time. Pretraining on huge EMR databases → fine-tuning on specific risk tasks.
Calibration Mandatory Risk score "68%" must mean exactly 68% probability. Platt scaling or isotonic regression after training. Calibration plot (reliability diagram) — required metric in papers and validation.
Risk Model Validation
| Metric | Clinical Meaning |
|---|---|
| AUC-ROC | Discrimination: separates sick from healthy |
| AUC-PR | With strong class imbalance (rare events) |
| Brier Score | Overall accuracy of probabilistic predictions |
| Net Benefit / Decision Curve | Clinical utility at specific decision thresholds |
| NRI, IDI | Improvement vs. existing risk score |
External validation on data from different clinic — mandatory before clinical application.
Implementation in Population Health
Stratification and Outreach
Patients stratified by risk score: high risk → active outreach (call, screening invitation, intensive monitoring). Medium risk → preventive messages. Low risk → standard care.
Integration in EMR
Risk score displayed in patient card at physician visit. Physician sees: "10-year CVD risk: 23% (high). Main factors: hypertension, dyslipidemia, smoking." SHAP explanation for specific patient.
Return on investment: reducing hospitalizations through prevention. In population of 100k → identify 1500–2000 high-risk → intervention → prevent 200–400 hospitalizations.







