Hyperparameter Optimization Implementation (Optuna, Ray Tune, Hyperopt)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Hyperparameter Optimization Implementation (Optuna, Ray Tune, Hyperopt)
Medium
~2-3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Hyperparameter Optimization: Optuna and Ray Tune

A typical scenario: the model is trained, and the baseline accuracy seems acceptable. But learning_rate=0.001 was taken "from examples in the documentation," batch_size=32 "because it's standard," and dropout=0.3 was "just an eyeball." After a proper hyperparameterization run on the same dataset and the same architecture, we get +4–8% accuracy—simply due to the right hyperparameters. This isn't magic; it's systematic research.

Why Random Search Loses to Bayesian Optimization

Random Search is effective for high-dimensionality spaces and small trial budgets. But as soon as there are 3–5 important hyperparameters (which is typical), Bayesian Optimization with TPE (Tree-structured Parzen Estimator) starts to win around the 30th trial. TPE constructs separate probability densities for "good" (top 25%) and "bad" configurations, then suggests configurations with high EI (Expected Improvement).

Grid Search in 2025 is only applicable to a maximum of two hyperparameters—beyond that, the combinatorial explosion makes it impractical.

Deep Dive: Optuna in Production

Optuna is the de facto standard for HPO in the Python ecosystem. Key advantages over competitors include a Pythonic API without YAML configurations, built-in pruning support (trimming bad trials early), and integration with MLflow and Weights & Biases.

Full Example: LightGBM Optimization with Pruning

import optuna
from optuna.integration import LightGBMPruningCallback
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import numpy as np

def objective(trial: optuna.Trial, X, y) -> float:
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'verbosity': -1,
        'boosting_type': trial.suggest_categorical('boosting', ['gbdt', 'dart']),
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 20, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 300),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-9, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-9, 10.0, log=True),
    }

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = []

    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        dtrain = lgb.Dataset(X_train, label=y_train)
        dval = lgb.Dataset(X_val, label=y_val, reference=dtrain)

        # Pruning callback: обрезает плохие триалы после каждого round
        pruning_callback = LightGBMPruningCallback(trial, 'auc', valid_name='valid_1')

        model = lgb.train(
            params,
            dtrain,
            valid_sets=[dtrain, dval],
            num_boost_round=params['n_estimators'],
            callbacks=[
                lgb.early_stopping(stopping_rounds=50, verbose=False),
                lgb.log_evaluation(period=-1),
                pruning_callback,
            ],
        )

        y_pred = model.predict(X_val)
        cv_scores.append(roc_auc_score(y_val, y_pred))

    return float(np.mean(cv_scores))


# Создаём study с TPE sampler и Hyperband pruner
sampler = optuna.samplers.TPESampler(
    n_startup_trials=20,     # случайный поиск до построения surrogate
    multivariate=True,       # учитывает корреляции между параметрами
    seed=42
)
pruner = optuna.pruners.HyperbandPruner(
    min_resource=50,
    max_resource=2000,
    reduction_factor=3
)

study = optuna.create_study(
    direction='maximize',
    sampler=sampler,
    pruner=pruner,
    study_name='lgbm_credit_scoring',
    storage='sqlite:///optuna_studies.db',  # persistence между сессиями
    load_if_exists=True
)

study.optimize(
    lambda trial: objective(trial, X, y),
    n_trials=200,
    n_jobs=4,          # параллельные триалы
    timeout=3600,      # максимум 1 час
    show_progress_bar=True
)

print(f'Best AUC: {study.best_value:.4f}')
print(f'Best params: {study.best_params}')

Pruning is the key to saving computation. Hyperband Pruner eliminates bad trials early in training. In practice, out of 200 LightGBM trials, 40–60% are pruned after 50–100 rounds instead of the full 2000. The resulting speedup is 3–5× compared to the same number of full trials.

Visualization and analysis of parameter importance:

import optuna.visualization as vis

# Какие гиперпараметры реально влияют на результат
fig = vis.plot_param_importances(study)
fig.show()

# История оптимизации — смотрим, сошлась ли она
fig = vis.plot_optimization_history(study)
fig.show()

# Correlation matrix: num_leaves vs learning_rate
fig = vis.plot_contour(study, params=['num_leaves', 'learning_rate'])
fig.show()

Parameter importance analysis using fANOVA often yields unexpected results: num_leaves and min_child_samples are more important than learning_rate for LightGBM on imbalanced data. This changes the strategy—the next search focuses on a narrow range of important parameters.

Ray Tune: distributed HPO on a cluster

Ray Tune solves a different problem: parallel search on a GPU cluster. While Optuna with n_jobs=4 parallelizes on a single machine, Ray Tune scales to hundreds of nodes.

from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch
import torch

def train_transformer(config: dict):
    """
    Ray Tune ожидает функцию, которая репортит метрики через tune.report().
    """
    model = build_model(
        hidden_dim=config['hidden_dim'],
        num_heads=config['num_heads'],
        num_layers=config['num_layers'],
        dropout=config['dropout']
    )
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config['lr'],
        weight_decay=config['weight_decay']
    )

    for epoch in range(config['max_epochs']):
        train_loss = train_one_epoch(model, optimizer)
        val_loss, val_acc = evaluate(model)

        # Ray Tune получает метрику для Scheduler/Search
        tune.report(
            val_loss=val_loss,
            val_acc=val_acc,
            epoch=epoch
        )

# ASHA (Asynchronous Successive Halving) — aggressive early stopping
scheduler = ASHAScheduler(
    time_attr='epoch',
    max_t=100,              # максимум epochs
    grace_period=10,        # минимум epochs до обрезки
    reduction_factor=3,     # каждые 3× — половина триалов выбывает
    metric='val_loss',
    mode='min'
)

# OptunaSearch внутри Ray Tune — лучший из обоих миров
search_alg = OptunaSearch(
    metric='val_loss',
    mode='min',
    sampler=optuna.samplers.TPESampler(seed=42)
)

search_space = {
    'hidden_dim': tune.choice([128, 256, 512]),
    'num_heads': tune.choice([4, 8, 16]),
    'num_layers': tune.randint(2, 8),
    'dropout': tune.uniform(0.0, 0.5),
    'lr': tune.loguniform(1e-5, 1e-2),
    'weight_decay': tune.loguniform(1e-8, 1e-3),
    'max_epochs': 100
}

analysis = tune.run(
    train_transformer,
    config=search_space,
    num_samples=100,        # общее число триалов
    scheduler=scheduler,
    search_alg=search_alg,
    resources_per_trial={'gpu': 1, 'cpu': 4},
    storage_path='s3://my-bucket/ray-results',   # S3 для distributed setup
    name='transformer_hpo_v2'
)

best_config = analysis.get_best_config(metric='val_loss', mode='min')

Case: HPO for fraud detection model

Problem: binary transaction classification, 1:340 imbalance (fraud:normal), 2.1M records. Baseline XGBoost with default parameters: PR-AUC = 0.412.

Optuna, 150 trials, 4 parallel workers, ~2.5 hours:

  • search space: 11 XGBoost parameters + scale_pos_weight (1–350)
  • metric: PR-AUC on stratified 5-fold CV
  • pruner: MedianPruner (prunes trials below the median in the early stages)

Result: PR-AUC = 0.581 (+41% relative to baseline). The most important parameters in fANOVA: scale_pos_weight (22% importance), min_child_weight (18%), subsample (15%). max_depth and n_estimators — 14% in total.

Stage PR-AUC Recall at Precision=0.8
XGBoost default 0.412 0.34
Random Search (50 trials) 0.521 0.47
Optuna TPE (150 trials) 0.581 0.56
+ Feature engineering 0.634 0.62

Optuna vs. Ray Tune: When to Choose Which

Criterion Optuna Ray Tune
One machine, 1–8 GPUs + redundant
Cluster of 10+ GPUs/nodes more difficult +
Deep learning (PyTorch/JAX) + +
Classic ML (sklearn, lgbm) + works
Integration with distributed training via callbacks native
Disaster recovery SQLite/PostgreSQL backend +
Learning curve for a new team gentle cooler

Integration with MLflow and Weights & Biases

import mlflow
import optuna

def objective_with_tracking(trial):
    with mlflow.start_run(nested=True):
        params = {
            'lr': trial.suggest_float('lr', 1e-5, 1e-1, log=True),
            'dropout': trial.suggest_float('dropout', 0.1, 0.5),
        }
        mlflow.log_params(params)
        # ... обучение
        val_acc = train_and_evaluate(params)
        mlflow.log_metric('val_acc', val_acc)
        return val_acc

# Все триалы — отдельные MLflow runs, удобно для сравнения
with mlflow.start_run(run_name='hpo_study'):
    study.optimize(objective_with_tracking, n_trials=100)
    mlflow.log_metric('best_val_acc', study.best_value)
    mlflow.log_params(study.best_params)

Typical mistakes. Data leakage in objective: if preprocessing (StandardScaler, target encoding) is fitted on the entire train-set before CV, the HPO results are optimistically overstated, and production degradation is guaranteed. Scaler should only be fitted on the train-fold within CV. Another one: optimizing accuracy instead of business metrics when classes are imbalanced – we find a configuration with 98.3% accuracy with a minority-class recall of 0.04.

Timeframe: Basic HPO with Optuna on a single task takes 2–5 days, including environment setup and results analysis. Distributed HPO with Ray Tune on a cluster, integration with the CI/CD pipeline takes 2–4 weeks.