Hyperparameter Optimization: Optuna and Ray Tune
A typical scenario: the model is trained, and the baseline accuracy seems acceptable. But learning_rate=0.001 was taken "from examples in the documentation," batch_size=32 "because it's standard," and dropout=0.3 was "just an eyeball." After a proper hyperparameterization run on the same dataset and the same architecture, we get +4–8% accuracy—simply due to the right hyperparameters. This isn't magic; it's systematic research.
Why Random Search Loses to Bayesian Optimization
Random Search is effective for high-dimensionality spaces and small trial budgets. But as soon as there are 3–5 important hyperparameters (which is typical), Bayesian Optimization with TPE (Tree-structured Parzen Estimator) starts to win around the 30th trial. TPE constructs separate probability densities for "good" (top 25%) and "bad" configurations, then suggests configurations with high EI (Expected Improvement).
Grid Search in 2025 is only applicable to a maximum of two hyperparameters—beyond that, the combinatorial explosion makes it impractical.
Deep Dive: Optuna in Production
Optuna is the de facto standard for HPO in the Python ecosystem. Key advantages over competitors include a Pythonic API without YAML configurations, built-in pruning support (trimming bad trials early), and integration with MLflow and Weights & Biases.
Full Example: LightGBM Optimization with Pruning
import optuna
from optuna.integration import LightGBMPruningCallback
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import numpy as np
def objective(trial: optuna.Trial, X, y) -> float:
params = {
'objective': 'binary',
'metric': 'auc',
'verbosity': -1,
'boosting_type': trial.suggest_categorical('boosting', ['gbdt', 'dart']),
'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3, log=True),
'num_leaves': trial.suggest_int('num_leaves', 20, 300),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'min_child_samples': trial.suggest_int('min_child_samples', 5, 300),
'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-9, 10.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-9, 10.0, log=True),
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
dtrain = lgb.Dataset(X_train, label=y_train)
dval = lgb.Dataset(X_val, label=y_val, reference=dtrain)
# Pruning callback: обрезает плохие триалы после каждого round
pruning_callback = LightGBMPruningCallback(trial, 'auc', valid_name='valid_1')
model = lgb.train(
params,
dtrain,
valid_sets=[dtrain, dval],
num_boost_round=params['n_estimators'],
callbacks=[
lgb.early_stopping(stopping_rounds=50, verbose=False),
lgb.log_evaluation(period=-1),
pruning_callback,
],
)
y_pred = model.predict(X_val)
cv_scores.append(roc_auc_score(y_val, y_pred))
return float(np.mean(cv_scores))
# Создаём study с TPE sampler и Hyperband pruner
sampler = optuna.samplers.TPESampler(
n_startup_trials=20, # случайный поиск до построения surrogate
multivariate=True, # учитывает корреляции между параметрами
seed=42
)
pruner = optuna.pruners.HyperbandPruner(
min_resource=50,
max_resource=2000,
reduction_factor=3
)
study = optuna.create_study(
direction='maximize',
sampler=sampler,
pruner=pruner,
study_name='lgbm_credit_scoring',
storage='sqlite:///optuna_studies.db', # persistence между сессиями
load_if_exists=True
)
study.optimize(
lambda trial: objective(trial, X, y),
n_trials=200,
n_jobs=4, # параллельные триалы
timeout=3600, # максимум 1 час
show_progress_bar=True
)
print(f'Best AUC: {study.best_value:.4f}')
print(f'Best params: {study.best_params}')
Pruning is the key to saving computation. Hyperband Pruner eliminates bad trials early in training. In practice, out of 200 LightGBM trials, 40–60% are pruned after 50–100 rounds instead of the full 2000. The resulting speedup is 3–5× compared to the same number of full trials.
Visualization and analysis of parameter importance:
import optuna.visualization as vis
# Какие гиперпараметры реально влияют на результат
fig = vis.plot_param_importances(study)
fig.show()
# История оптимизации — смотрим, сошлась ли она
fig = vis.plot_optimization_history(study)
fig.show()
# Correlation matrix: num_leaves vs learning_rate
fig = vis.plot_contour(study, params=['num_leaves', 'learning_rate'])
fig.show()
Parameter importance analysis using fANOVA often yields unexpected results: num_leaves and min_child_samples are more important than learning_rate for LightGBM on imbalanced data. This changes the strategy—the next search focuses on a narrow range of important parameters.
Ray Tune: distributed HPO on a cluster
Ray Tune solves a different problem: parallel search on a GPU cluster. While Optuna with n_jobs=4 parallelizes on a single machine, Ray Tune scales to hundreds of nodes.
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch
import torch
def train_transformer(config: dict):
"""
Ray Tune ожидает функцию, которая репортит метрики через tune.report().
"""
model = build_model(
hidden_dim=config['hidden_dim'],
num_heads=config['num_heads'],
num_layers=config['num_layers'],
dropout=config['dropout']
)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=config['lr'],
weight_decay=config['weight_decay']
)
for epoch in range(config['max_epochs']):
train_loss = train_one_epoch(model, optimizer)
val_loss, val_acc = evaluate(model)
# Ray Tune получает метрику для Scheduler/Search
tune.report(
val_loss=val_loss,
val_acc=val_acc,
epoch=epoch
)
# ASHA (Asynchronous Successive Halving) — aggressive early stopping
scheduler = ASHAScheduler(
time_attr='epoch',
max_t=100, # максимум epochs
grace_period=10, # минимум epochs до обрезки
reduction_factor=3, # каждые 3× — половина триалов выбывает
metric='val_loss',
mode='min'
)
# OptunaSearch внутри Ray Tune — лучший из обоих миров
search_alg = OptunaSearch(
metric='val_loss',
mode='min',
sampler=optuna.samplers.TPESampler(seed=42)
)
search_space = {
'hidden_dim': tune.choice([128, 256, 512]),
'num_heads': tune.choice([4, 8, 16]),
'num_layers': tune.randint(2, 8),
'dropout': tune.uniform(0.0, 0.5),
'lr': tune.loguniform(1e-5, 1e-2),
'weight_decay': tune.loguniform(1e-8, 1e-3),
'max_epochs': 100
}
analysis = tune.run(
train_transformer,
config=search_space,
num_samples=100, # общее число триалов
scheduler=scheduler,
search_alg=search_alg,
resources_per_trial={'gpu': 1, 'cpu': 4},
storage_path='s3://my-bucket/ray-results', # S3 для distributed setup
name='transformer_hpo_v2'
)
best_config = analysis.get_best_config(metric='val_loss', mode='min')
Case: HPO for fraud detection model
Problem: binary transaction classification, 1:340 imbalance (fraud:normal), 2.1M records. Baseline XGBoost with default parameters: PR-AUC = 0.412.
Optuna, 150 trials, 4 parallel workers, ~2.5 hours:
- search space: 11 XGBoost parameters +
scale_pos_weight(1–350) - metric: PR-AUC on stratified 5-fold CV
- pruner: MedianPruner (prunes trials below the median in the early stages)
Result: PR-AUC = 0.581 (+41% relative to baseline). The most important parameters in fANOVA: scale_pos_weight (22% importance), min_child_weight (18%), subsample (15%). max_depth and n_estimators — 14% in total.
| Stage | PR-AUC | Recall at Precision=0.8 |
|---|---|---|
| XGBoost default | 0.412 | 0.34 |
| Random Search (50 trials) | 0.521 | 0.47 |
| Optuna TPE (150 trials) | 0.581 | 0.56 |
| + Feature engineering | 0.634 | 0.62 |
Optuna vs. Ray Tune: When to Choose Which
| Criterion | Optuna | Ray Tune |
|---|---|---|
| One machine, 1–8 GPUs | + | redundant |
| Cluster of 10+ GPUs/nodes | more difficult | + |
| Deep learning (PyTorch/JAX) | + | + |
| Classic ML (sklearn, lgbm) | + | works |
| Integration with distributed training | via callbacks | native |
| Disaster recovery | SQLite/PostgreSQL backend | + |
| Learning curve for a new team | gentle | cooler |
Integration with MLflow and Weights & Biases
import mlflow
import optuna
def objective_with_tracking(trial):
with mlflow.start_run(nested=True):
params = {
'lr': trial.suggest_float('lr', 1e-5, 1e-1, log=True),
'dropout': trial.suggest_float('dropout', 0.1, 0.5),
}
mlflow.log_params(params)
# ... обучение
val_acc = train_and_evaluate(params)
mlflow.log_metric('val_acc', val_acc)
return val_acc
# Все триалы — отдельные MLflow runs, удобно для сравнения
with mlflow.start_run(run_name='hpo_study'):
study.optimize(objective_with_tracking, n_trials=100)
mlflow.log_metric('best_val_acc', study.best_value)
mlflow.log_params(study.best_params)
Typical mistakes. Data leakage in objective: if preprocessing (StandardScaler, target encoding) is fitted on the entire train-set before CV, the HPO results are optimistically overstated, and production degradation is guaranteed. Scaler should only be fitted on the train-fold within CV. Another one: optimizing accuracy instead of business metrics when classes are imbalanced – we find a configuration with 98.3% accuracy with a minority-class recall of 0.04.
Timeframe: Basic HPO with Optuna on a single task takes 2–5 days, including environment setup and results analysis. Distributed HPO with Ray Tune on a cluster, integration with the CI/CD pipeline takes 2–4 weeks.







