H2O.ai AutoML Integration for Automatic Model Training
H2O.ai AutoML is one of the most mature industrial AutoML platforms with built-in staking, leaderboard, and distributed training support on a Spark/Hadoop cluster.
H2O AutoML Key Features
What H2O AutoML does:
- Automatically runs algorithms: GBM, XGBoost, Random Forest, Deep Learning, GLM, Stacked Ensembles
- Builds a Stacked Ensemble from the best models
- Leaderboard with sorting by selected metric
- Cross-validation is built-in by default
Basic integration
Python client:
import h2o
from h2o.automl import H2OAutoML
import pandas as pd
def run_h2o_automl(train_df: pd.DataFrame,
target_col: str,
max_models: int = 20,
max_runtime_secs: int = 600) -> dict:
"""
H2O AutoML полный pipeline.
"""
# Инициализация (локально или на кластере)
h2o.init(nthreads=-1, max_mem_size='8G')
# Конвертация в H2OFrame
h2o_train = h2o.H2OFrame(train_df)
# Типы колонок
for col in train_df.select_dtypes(include=['object']).columns:
h2o_train[col] = h2o_train[col].asfactor()
if train_df[target_col].nunique() <= 20:
h2o_train[target_col] = h2o_train[target_col].asfactor()
feature_cols = [c for c in train_df.columns if c != target_col]
# Запуск AutoML
aml = H2OAutoML(
max_models=max_models,
max_runtime_secs=max_runtime_secs,
seed=42,
sort_metric='AUC',
balance_classes=True,
stopping_metric='AUC',
stopping_rounds=5
)
aml.train(x=feature_cols, y=target_col, training_frame=h2o_train)
# Leaderboard
lb = aml.leaderboard.as_data_frame()
# Лучшая модель
best_model = aml.leader
# MOJO для production деплоя
mojo_path = best_model.save_mojo(path='/tmp/h2o_mojo/')
return {
'leaderboard': lb,
'best_model_id': best_model.model_id,
'best_auc': lb.iloc[0]['auc'],
'mojo_path': mojo_path
}
Production deployment of H2O MOJO
Java-based inference without H2O server:
import subprocess
import json
def deploy_h2o_mojo_rest_api(mojo_path: str, port: int = 8080):
"""
H2O MOJO: компилируется в Java-артефакт, работает без Python и H2O.
Подходит для встраивания в Java/Scala микросервисы.
"""
# Запуск H2O Scoring Server (REST API для MOJO)
cmd = [
'java', '-cp', 'h2o-genmodel.jar:scoring-server.jar',
'hex.genmodel.tools.PredictCsv',
'--mojo', mojo_path,
'--input', '/dev/stdin'
]
# В production: используется h2o-mojo-scoring-server Docker образ
return {'endpoint': f'http://localhost:{port}/predict', 'format': 'CSV/JSON'}
def predict_with_mojo_api(endpoint: str, features: dict) -> dict:
import requests
response = requests.post(f'{endpoint}', json={'features': features})
return response.json()
Integration with Spark (H2O Sparkling Water)
Distributed training on Spark cluster:
# pysparkling — H2O на Spark
from pysparkling import H2OContext
from pysparkling.ml import H2OAutoML as SparkH2OAutoML
from pyspark.sql import SparkSession
def h2o_sparkling_automl(spark_df, target_col: str):
"""
H2O Sparkling Water: AutoML на Spark DataFrame.
Подходит для датасетов > 10 млн строк.
"""
spark = SparkSession.builder.getOrCreate()
hc = H2OContext.getOrCreate()
automl = SparkH2OAutoML(
maxModels=30,
labelCol=target_col,
maxRuntimeSecs=3600
)
automl.fit(spark_df)
leaderboard = automl.getAllModelsParams()
return automl, leaderboard
Timeframe: H2O AutoML baseline + leaderboard + MOJO export — 3-5 days. Sparkling Water cluster launch, custom metrics, continuous retraining pipeline — 2-3 weeks.







