AutoML implementation for automatic model and hyperparameter selection
AutoML automates the full ML cycle: from data preprocessing to algorithm selection and hyperparameter tuning. It's not a replacement for ML engineers, but a tool for accelerating prototyping and routine tasks.
AutoML Pipeline – What is automated?
Full AutoML Cycle:
automl_components = {
'1_data_preprocessing': [
'imputation (median, mode, KNN)',
'encoding (OHE, target encoding, embeddings)',
'scaling (standard, robust, log-transform)',
'feature selection (mutual info, boruta)'
],
'2_feature_engineering': [
'polynomial features',
'interaction terms',
'temporal features (lag, rolling stats)',
'text features (TF-IDF, embeddings)'
],
'3_model_selection': [
'linear models', 'tree-based (RF, XGBoost, LightGBM, CatBoost)',
'neural networks (TabNet, NODE)',
'ensembles (stacking, blending)'
],
'4_hyperparameter_optimization': [
'Bayesian optimization (Optuna, SMAC)',
'random search', 'CMA-ES'
],
'5_model_evaluation': [
'cross-validation (stratified, time-series)',
'learning curves', 'holdout validation'
]
}
FLAML — Fast AutoML from Microsoft
Minimal example for tabular data:
from flaml import AutoML
import pandas as pd
from sklearn.model_selection import train_test_split
def run_automl_classification(X: pd.DataFrame, y: pd.Series,
time_budget: int = 300) -> dict:
"""
FLAML: экономичный AutoML с low-cost trial estimation.
time_budget: секунды на оптимизацию.
"""
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
automl = AutoML()
automl.fit(
X_train, y_train,
task='classification',
time_budget=time_budget,
metric='roc_auc',
eval_method='cv',
n_splits=5,
verbose=1
)
y_pred = automl.predict(X_test)
y_proba = automl.predict_proba(X_test)[:, 1]
from sklearn.metrics import roc_auc_score, classification_report
return {
'best_model': automl.best_estimator,
'best_config': automl.best_config,
'roc_auc': roc_auc_score(y_test, y_proba),
'classification_report': classification_report(y_test, y_pred),
'training_duration_s': automl.time_to_find_best_model
}
Auto-sklearn - Meta-Learning
Using meta-knowledge about tasks:
import autosklearn.classification
def run_autosklearn(X_train, y_train, X_test, y_test,
time_left: int = 600) -> dict:
"""
Auto-sklearn использует мета-обучение: угадывает хорошие начальные конфигурации
из базы данных результатов на похожих датасетах.
"""
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=time_left,
per_run_time_limit=30,
memory_limit=4096,
ensemble_size=10,
ensemble_nbest=10,
metric=autosklearn.metrics.roc_auc
)
automl.fit(X_train, y_train)
# Sprint statistics
print(automl.sprint_statistics())
y_pred = automl.predict(X_test)
return {
'model': automl,
'leaderboard': automl.leaderboard(),
'best_config': automl.get_configuration_space()
}
Optuna + LightGBM — Advanced Optimization
Full pipeline with preprocessing:
import optuna
from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
def create_lgbm_pipeline_study(X, y, n_trials=100):
def objective(trial):
# Гиперпараметры препроцессинга
imputer_strategy = trial.suggest_categorical('imputer_strategy', ['mean', 'median', 'most_frequent'])
# Гиперпараметры LightGBM
lgbm_params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'learning_rate': trial.suggest_float('lr', 1e-3, 0.3, log=True),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'num_leaves': trial.suggest_int('num_leaves', 15, 255),
'min_child_samples': trial.suggest_int('min_child_samples', 5, 200),
'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10, log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10, log=True),
'class_weight': 'balanced'
}
pipeline = Pipeline([
('imputer', SimpleImputer(strategy=imputer_strategy)),
('scaler', StandardScaler()),
('model', LGBMClassifier(**lgbm_params, verbose=-1))
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc', n_jobs=-1)
return scores.mean() - scores.std() # средний AUC - штраф за нестабильность
study = optuna.create_study(direction='maximize',
sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=n_trials, n_jobs=1)
return study
When to Use AutoML vs. Manual Development:
| Scenario | AutoML | Manual development |
|---|---|---|
| Prototype in a day | + | - |
| Standard binary classification | + | there is no point |
| Non-standard features (text + graph + numbers) | partially | + |
| Strict inference latency requirements | - | + |
| Regulatory requirements for interpretation | - | + |
Timeframe: FLAML/Optuna baseline + CV pipeline + results report — 1-2 weeks. Custom metrics, ensemble stacking, and feature engineering within AutoML — 3-4 weeks.







