What is hyperparameter optimization?

It is the systematic search for the best model parameters like learning rate, batch size, or tree depth. Algorithms like TPE or Hyperband typically yield 4–8% accuracy improvement without changing the architecture.

What is the difference between Optuna and Ray Tune?

Optuna is simpler for a single machine: it has built-in pruning and integration with MLflow. Ray Tune scales to GPU clusters but requires more infrastructure. For 1–8 GPUs use Optuna; for 10+ nodes use Ray Tune.

How does pruning save time?

Pruning terminates unpromising trials early. For example, Hyperband prunes 40–60% of LightGBM trials after 50–100 rounds instead of full 2000 — accelerating the search by 3–5x.

Which hyperparameters are most important in LightGBM?

According to fANOVA: num_leaves, min_child_samples, learning_rate. For imbalanced data, scale_pos_weight contributes up to 22% importance. Importance analysis is essential — often non-obvious.

How long does a turnkey HPO take?

Basic Optuna HPO on a single task: 2–5 days. Distributed HPO with Ray Tune and CI/CD integration: 2–4 weeks. Timelines depend on data volume and GPU count. We will assess your project free of charge.

What is hyperparameter optimization?

It is the systematic search for the best model parameters like learning rate, batch size, or tree depth. Algorithms like TPE or Hyperband typically yield 4–8% accuracy improvement without changing the architecture.

What is the difference between Optuna and Ray Tune?

Optuna is simpler for a single machine: it has built-in pruning and integration with MLflow. Ray Tune scales to GPU clusters but requires more infrastructure. For 1–8 GPUs use Optuna; for 10+ nodes use Ray Tune.

How does pruning save time?

Pruning terminates unpromising trials early. For example, Hyperband prunes 40–60% of LightGBM trials after 50–100 rounds instead of full 2000 — accelerating the search by 3–5x.

Which hyperparameters are most important in LightGBM?

According to fANOVA: num_leaves, min_child_samples, learning_rate. For imbalanced data, scale_pos_weight contributes up to 22% importance. Importance analysis is essential — often non-obvious.

How long does a turnkey HPO take?

Basic Optuna HPO on a single task: 2–5 days. Distributed HPO with Ray Tune and CI/CD integration: 2–4 weeks. Timelines depend on data volume and GPU count. We will assess your project free of charge.

Hyperparameter Optimization with Optuna and Ray Tune: HPO in Practice

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Hyperparameter Optimization with Optuna and Ray Tune: HPO in Practice

Medium

~2-3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

We often see: a model is trained, baseline accuracy seems acceptable, but hyperparameters are taken "from examples". Learning rate — because "that's what the tutorial said", batch size — "standard", dropout — "by eye". After proper HPO on the same data and architecture, we get +4–8% accuracy. This is not magic, but systematic search using Optuna, Ray Tune, and Hyperopt. Let's break down how we integrate HPO into production and save up to 5× compute resources.

Why Bayesian Optimization beats Random Search

Random Search is effective for high-dimensional spaces and small budgets. But when important hyperparameters number 3–5 (typical case), Bayesian Optimization with TPE starts winning from ~30th trial. TPE builds separate densities for "good" (top-25%) and "bad" configurations, then suggests configurations with high Expected Improvement. Grid Search today is only applicable to two hyperparameters — beyond that, combinatorial explosion.

How Optuna cuts search time

Optuna is the de-facto standard for HPO in Python. Key advantages: Pythonic API with no YAML configs, built-in pruning, integration with MLflow and Weights & Biases. The killer feature is Hyperband Pruner, which cuts bad trials early. In practice, out of 200 LightGBM trials, 40–60% are pruned after 50–100 rounds instead of full 2000. Resulting speedup: 3–5×.

Full example: LightGBM optimization with pruning

import optuna
from optuna.integration import LightGBMPruningCallback
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import numpy as np

def objective(trial: optuna.Trial, X, y) -> float:
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'verbosity': -1,
        'boosting_type': trial.suggest_categorical('boosting', ['gbdt', 'dart']),
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 20, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 300),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-9, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-9, 10.0, log=True),
    }

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = []

    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        dtrain = lgb.Dataset(X_train, label=y_train)
        dval = lgb.Dataset(X_val, label=y_val, reference=dtrain)

        pruning_callback = LightGBMPruningCallback(trial, 'auc', valid_name='valid_1')

        model = lgb.train(
            params,
            dtrain,
            valid_sets=[dtrain, dval],
            num_boost_round=params['n_estimators'],
            callbacks=[
                lgb.early_stopping(stopping_rounds=50, verbose=False),
                lgb.log_evaluation(period=-1),
                pruning_callback,
            ],
        )

        y_pred = model.predict(X_val)
        cv_scores.append(roc_auc_score(y_val, y_pred))

    return float(np.mean(cv_scores))

sampler = optuna.samplers.TPESampler(
    n_startup_trials=20,
    multivariate=True,
    seed=42
)
pruner = optuna.pruners.HyperbandPruner(
    min_resource=50,
    max_resource=2000,
    reduction_factor=3
)

study = optuna.create_study(
    direction='maximize',
    sampler=sampler,
    pruner=pruner,
    study_name='lgbm_credit_scoring',
    storage='sqlite:///optuna_studies.db',
    load_if_exists=True
)

study.optimize(
    lambda trial: objective(trial, X, y),
    n_trials=200,
    n_jobs=4,
    timeout=3600,
    show_progress_bar=True
)

print(f'Best AUC: {study.best_value:.4f}')
print(f'Best params: {study.best_params}')

Visualization and parameter importance analysis:

import optuna.visualization as vis

fig = vis.plot_param_importances(study)
fig.show()

fig = vis.plot_optimization_history(study)
fig.show()

fig = vis.plot_contour(study, params=['num_leaves', 'learning_rate'])
fig.show()

fANOVA analysis often yields unexpected results: num_leaves and min_child_samples turn out to be more important than learning_rate for LightGBM on imbalanced data.

When to choose Ray Tune?

Ray Tune solves a different problem — parallel search on a GPU cluster. If Optuna with n_jobs=4 parallelizes on a single machine, Ray Tune scales to hundreds of nodes. Ray Tune is better suited for deep learning with distributed training, while Optuna is for classical ML on a single machine.

from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch
import torch

def train_transformer(config: dict):
    model = build_model(
        hidden_dim=config['hidden_dim'],
        num_heads=config['num_heads'],
        num_layers=config['num_layers'],
        dropout=config['dropout']
    )
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config['lr'],
        weight_decay=config['weight_decay']
    )

    for epoch in range(config['max_epochs']):
        train_loss = train_one_epoch(model, optimizer)
        val_loss, val_acc = evaluate(model)
        tune.report(val_loss=val_loss, val_acc=val_acc, epoch=epoch)

scheduler = ASHAScheduler(
    time_attr='epoch',
    max_t=100,
    grace_period=10,
    reduction_factor=3,
    metric='val_loss',
    mode='min'
)

search_alg = OptunaSearch(
    metric='val_loss',
    mode='min',
    sampler=optuna.samplers.TPESampler(seed=42)
)

search_space = {
    'hidden_dim': tune.choice([128, 256, 512]),
    'num_heads': tune.choice([4, 8, 16]),
    'num_layers': tune.randint(2, 8),
    'dropout': tune.uniform(0.0, 0.5),
    'lr': tune.loguniform(1e-5, 1e-2),
    'weight_decay': tune.loguniform(1e-8, 1e-3),
    'max_epochs': 100
}

analysis = tune.run(
    train_transformer,
    config=search_space,
    num_samples=100,
    scheduler=scheduler,
    search_alg=search_alg,
    resources_per_trial={'gpu': 1, 'cpu': 4},
    storage_path='s3://my-bucket/ray-results',
    name='transformer_hpo_v2'
)

best_config = analysis.get_best_config(metric='val_loss', mode='min')

Case: HPO for a fraud detection model

Task: binary classification of transactions, imbalance 1:340 (fraud:normal), 2.1M records. Baseline XGBoost with default parameters: PR-AUC = 0.412.

Optuna, 150 trials, 4 parallel workers, ~2.5 hours:

search space: 11 XGBoost parameters + scale_pos_weight (1–350)
metric: PR-AUC on stratified 5-fold CV
pruner: MedianPruner

Result: PR-AUC = 0.581 (+41% vs baseline). Most important parameters: scale_pos_weight (22%), min_child_weight (18%), subsample (15%). max_depth and n_estimators — total 14%.

Stage	PR-AUC	Recall at Precision=0.8
XGBoost default	0.412	0.34
Random Search (50 trials)	0.521	0.47
Optuna TPE (150 trials)	0.581	0.56
+ Feature engineering	0.634	0.62

Savings from implementation: a 23% reduction in false positives saved the client significant manual verification costs.

Optuna vs Ray Tune: when to choose what

Criterion	Optuna	Ray Tune
Single machine, 1–8 GPUs	+	overkill
Cluster 10+ GPUs/nodes	harder	+
Deep learning (PyTorch/JAX)	+	+
Classical ML (sklearn, lgbm)	+	works
Integration with distributed training	via callbacks	native
Recovery after failure	SQLite/PostgreSQL backend	+
Learning curve for new team	gentle	steeper

Integration with MLflow and Weights & Biases

import mlflow
import optuna

def objective_with_tracking(trial):
    with mlflow.start_run(nested=True):
        params = {
            'lr': trial.suggest_float('lr', 1e-5, 1e-1, log=True),
            'dropout': trial.suggest_float('dropout', 0.1, 0.5),
        }
        mlflow.log_params(params)
        val_acc = train_and_evaluate(params)
        mlflow.log_metric('val_acc', val_acc)
        return val_acc

with mlflow.start_run(run_name='hpo_study'):
    study.optimize(objective_with_tracking, n_trials=100)
    mlflow.log_metric('best_val_acc', study.best_value)
    mlflow.log_params(study.best_params)

Typical mistakes and how to avoid them

Data leakage in the objective: if preprocessing (StandardScaler, target encoding) is fitted on the entire train-set before CV — HPO results are optimistically inflated, production degradation guaranteed. The scaler must be fitted only on the train-fold inside CV. Another mistake: optimizing accuracy instead of a business metric in class imbalance — we find a config with 98.3% accuracy but recall on minority class 0.04.

What is included in the turnkey work

Audit of current pipeline and tool selection (Optuna / Ray Tune / Hyperopt)
Configuration of search space and metrics based on business goals
Implementation of HPO with pruning and parallel trials
Integration with MLflow for experiment tracking
Documentation for result reproducibility
Team training on the tool

Process

Analytics — gather requirements, explore data, baseline models.
Design — choose HPO framework, define search space, metrics.
Implementation — write objective function, configure parallelism and pruning.
Testing — run on CV, check on holdout, compare with baseline.
Deployment — integrate best config into CI/CD, monitor in production.

Timeline and cost

Timeline: basic Optuna HPO on a single task — 2–5 days. Distributed HPO with Ray Tune and CI/CD integration — 2–4 weeks. Cost is calculated individually based on task complexity, data volume, and infrastructure requirements. We will assess your project free of charge — contact us for a consultation.

Our team has years of experience in ML production and has implemented dozens of HPO projects for clients in fintech, e-commerce, and ad tech.

How Do AutoGluon, FLAML, and Vertex AI AutoML Work and When to Use Them?

When a business wants to quickly get a model, we offer implementation of AutoML platforms. This is not a 'make me AI' button, but automation of hyperparameter tuning and algorithm selection. The difference is critical: without quality data and proper problem formulation, even the best platform will produce garbage. But for specific tasks, AutoML saves weeks of manual iterations.

AutoML automates model selection and hyperparameter tuning. On structured tabular data, modern systems compete with manual ML engineering. For example, on Kaggle competitions, AutoGluon without any tuning reaches the top 10% on many datasets. The reason: it builds an ensemble of LightGBM, XGBoost, CatBoost, neural networks, and RF with stacking — such an ensemble often outperforms the single best model by 5–10% in metric.

Good candidates for AutoML platforms:

Standard binary/multiclass classification or regression on tabular data
Tasks without strict latency (< 50 ms) or model size (< 10 MB) constraints
MVP or baseline before manual optimization
Teams without deep ML expertise needing a working prototype in 1–2 weeks

Bad candidates: custom loss, specific architectures, real-time inference with hard constraints, domain-specific tasks (medical imaging, NLP in a rare language).

What Makes AutoGluon the Best Choice for Tabular Data?

AutoGluon-Tabular is the strongest AutoML for tables by most benchmarks. The key feature is multi-level stacking. First-layer models (LightGBM, XGBoost, CatBoost, FastAI tabular, KNN) → their predictions as features → second-layer models. This is configured via num_stack_levels=2.

from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(
    label='target',
    eval_metric='roc_auc',
    path='./ag_models'
).fit(
    train_data,
    time_limit=3600,  # 1 hour
    presets='best_quality',  # vs 'medium_quality', 'high_quality'
)

Preset best_quality includes stacking and ensembles, uses maximum memory and time. medium_quality is a speed/quality balance suitable for >1M rows. optimize_for_deployment removes heavy ensembles, speeds up inference.

A typical pitfall: AutoGluon trains dozens of models and saves all to disk — from 2 to 10 GB for serious tasks. When deploying, export only the final model via predictor.clone_for_deployment(). Be careful with memory: with num_stack_levels=2 on 500k rows, OOM may occur on machines with <32 GB RAM. Solution: ag_args_fit={'num_cpus': 4, 'num_gpus': 0} and excluded_model_types=['NeuralNetFastAI'].

How Does FLAML Save Resources and Time?

FLAML (Fast and Lightweight AutoML) from Microsoft focuses on minimal compute budget while achieving good quality. It uses cost-frugal search: first tries cheap configurations, gradually moving to expensive ones. This yields up to 2x time savings compared to AutoGluon on the same budget, though final quality may be 3–5% lower.

from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification", time_budget=120, metric="roc_auc")

It is well suited for limited compute budgets, tasks requiring time_budget < 60 sec, and integration into CI/CD pipelines. FLAML also supports LLM fine-tuning via flaml.autogen — automatic prompt tuning for GPT/Claude.

What Are the Use Cases for Vertex AI AutoML?

Google Vertex AI AutoML is the right managed service when:

You don't have your own ML infrastructure
You need integration with BigQuery, Cloud Storage, Dataflow
The task is Computer Vision or NLP (not just tables)
You need a managed inference endpoint without DevOps

Training cost is per node hour. For 100k rows and 50 features, training typically takes 2–4 hours. Inference cost is per prediction. For high-load tasks, self-hosted AutoGluon is more cost-effective. Limitations: less control over architecture, model export only to TF SavedModel or TFLite, no ONNX. However, it provides managed feature store, automatic drift monitoring, and MLOps out of the box.

Comparison of Major AutoML Platforms

Characteristic	AutoGluon	FLAML	Vertex AI AutoML
Quality on tables	★★★★★	★★★★	★★★★
Training speed	★★★	★★★★★	★★★
Infrastructure requirements	Own machine/GPU	Any environment	Google Cloud
Flexibility (custom loss and pipelines)	High	Medium	Low
Best for	Production, high-quality	Fast experiments	Managed service

What Does AutoML Implementation Include?

We provide the full cycle: from quick benchmark to production system with monitoring. Deliverables include:

EDA and data preparation (feature engineering, handling missing values, encoding)
Training and comparison of 3+ AutoML configurations with metric logging
Selection of the best model and its export (ONNX, TF SavedModel, TorchScript)
Deployment of inference endpoint (Docker, Kubernetes, serverless)
Model card documentation and retraining instructions
Team training on platform usage (2 hours)

We guarantee a baseline in 5 business days, production solution in 2–4 weeks depending on complexity.

Work Process and Timelines

Analytics (1–2 days) — requirement gathering, EDA, metric definition.
Benchmark (2–3 days) — run AutoGluon medium_quality, FLAML, Vertex AI. Baseline recording.
Optimization (3–5 days) — feature engineering, manual hyperparameter tuning, stacking.
Test and validation (2–3 days) — evaluation on holdout set, drift check, A/B test.
Deployment (2–4 days) — containerization, CI/CD, monitoring metrics.

Timelines: MVP from 1 week. Full production system with auto-retraining from 3 weeks.

What Sets Us Apart for AutoML Implementation?

We have 5 years of experience and over 20 successful projects implementing AutoML platforms in retail, fintech, and logistics. Certified engineers in AWS Machine Learning and Google Cloud Professional Data Engineer. We don't just run code — we train your team and ensure the model performs stably in production.

Get a consultation on AutoML for your task — leave a request. Or order a free benchmark: we will analyze your data and tell you how much time and money AutoML can save.