Developing AI System for Molecule Property Prediction (ADMET)
ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) — set of pharmacokinetic properties determining drug fate in body. ~50% clinical trial failures — due to ADMET problems that could have been predicted earlier.
Critical ADMET Properties
Absorption
- Aqueous solubility: poor solubility → inconsistent bioavailability
- Lipophilicity (logP/logD): determines membrane penetration, solubility
- Caco-2 / MDCK permeability: intestinal absorption
- P-glycoprotein (P-gp) efflux: active cell export, reduces bioavailability
- Oral bioavailability (F%): what fraction of dose reaches systemic circulation
Distribution
- Volume of distribution (Vd): how distributes across tissues
- Blood-brain barrier permeability (BBB): needed for CNS drugs, undesirable for peripheral
- Plasma protein binding (PPB): albumin binding, only free drug active
Metabolism
- CYP450 inhibition (CYP3A4, CYP2D6, CYP2C9, CYP2C19, CYP1A2): slows metabolism of other drugs → interactions
- CYP450 substrate: which isoforms metabolize compound
- Half-life (T½): how fast cleared from body
- Hepatotoxicity (DILI): liver damage
Excretion
- Renal clearance: rate of kidney elimination
Toxicity
- hERG inhibition: blocking cardiac K⁺ channel → QT prolongation → potentially fatal arrhythmia. Major reason for drug withdrawal
- Ames test: mutagenicity / genotoxicity
- DILI (Drug-Induced Liver Injury): hepatotoxicity
- Skin sensitization: contact dermatitis
- Reproductive toxicity: teratogenicity
Prediction Models
Molecular Fingerprints + ML
ECFP4/6 (circular fingerprints 1024–2048 bits) + XGBoost/Random Forest. Fast, interpretable, good on small datasets.
Graph Neural Networks
Molecule as graph → GNN learns structural patterns. MPNN, AttentiveFP, D-MPNN (chemprop). On most TDC benchmarks GNN exceeds fingerprint+ML.
Multitask Learning
One model predicts 20+ ADMET properties simultaneously. Advantage: shared representations improve prediction for properties with small dataset through information from related tasks.
from chemprop import args, data, featurizers, models, train
# Chemprop — state-of-the-art for molecular ADMET
arguments = [
'--data_path', 'admet_train.csv',
'--dataset_type', 'regression',
'--target_columns', 'solubility logP hERG_inhibition caco2_permeability',
'--smiles_columns', 'smiles',
'--epochs', '50',
'--batch_size', '64',
'--ffn_num_layers', '3',
'--dropout', '0.1',
'--save_dir', 'admet_model',
]
args.parse_train_args(arguments)
train.cross_validate(...)
Uncertainty Quantification
ADMET prediction: know not just value but model confidence. For molecules outside applicability domain — warning about unreliable prediction.
Methods: Monte Carlo Dropout, Deep Ensembles, Conformal Prediction. Conformal Prediction gives statistically rigorous prediction intervals.
Datasets
| Task | Dataset | Size |
|---|---|---|
| Solubility | ESOL, AqSolDB | 1k–10k |
| logP | ChEMBL | 100k+ |
| Caco-2 | Biopharmaceutics DB | ~1k |
| hERG | BindingDB, ChEMBL | 10k+ |
| DILI | DILIrank | ~1k |
| CYP inhibition | ChEMBL | 10k+ |
| Ames | TDC AMES dataset | ~7k |
Data Problem: many biological datasets small and noisy. Transfer learning (pretraining on large chemical corpus → fine-tuning on specific task) helps with small datasets.
Applicability Domain
Model reliable only for molecules similar to training data. AD evaluation:
- Tanimoto similarity to nearest neighbors in training set
- Leverage hat matrix (Williams plot)
- k-NN distance in embedding space
When exiting AD → explicit warning "low confidence prediction".
Integration: REST API, Jupyter-friendly Python API, KNIME nodes for chemist workflows. Visualization: 2D property map with color-coding drug-likeness violations (Lipinski Rule of 5, Veber rules).







