Auto-sklearn Integration for Automated ML Pipeline Selection

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Auto-sklearn Integration for Automated ML Pipeline Selection
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Auto-sklearn integration for automatic ML pipeline selection

Auto-sklearn is an open-source AutoML framework based on Bayesian optimization and meta-learning. It uses Hyperband to stop unpromising configurations early and builds an ensemble of the best models.

Installation and basic use

Classification:

import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import pandas as pd

def run_autosklearn_classification(X: pd.DataFrame, y: pd.Series,
                                    time_budget: int = 600) -> dict:
    """
    Auto-sklearn v2: поддерживает только ограниченный набор алгоритмов
    по сравнению с v1, но значительно быстрее.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=time_budget,
        per_run_time_limit=60,         # лимит на одну конфигурацию
        memory_limit=4096,             # МБ
        n_jobs=4,
        ensemble_size=20,
        ensemble_nbest=15,
        seed=42,
        resampling_strategy='cv',
        resampling_strategy_arguments={'folds': 5}
    )

    automl.fit(X_train, y_train, dataset_name='my_dataset')

    y_pred = automl.predict(X_test)
    y_proba = automl.predict_proba(X_test)

    # Sprint statistics
    stats = automl.sprint_statistics()

    return {
        'roc_auc': roc_auc_score(y_test, y_proba[:, 1]),
        'leaderboard': automl.leaderboard(),
        'sprint_stats': stats,
        'models_count': len(automl.get_models_with_weights())
    }

Search customization

Limitation of the algorithm space:

def run_autosklearn_limited(X_train, y_train):
    """
    Ограничиваем поиск деревьями — быстрее и интерпретируемее.
    """
    from autosklearn.classification import AutoSklearnClassifier

    automl = AutoSklearnClassifier(
        time_left_for_this_task=300,
        include={
            'classifier': ['random_forest', 'gradient_boosting', 'extra_trees'],
            'feature_preprocessor': ['no_preprocessing', 'pca', 'select_percentile_classification']
        },
        exclude={'classifier': ['libsvm_svc', 'mlp']},  # медленные алгоритмы
        seed=42
    )
    automl.fit(X_train, y_train)
    return automl

Auto-sklearn for time series

Important: Correct cross-validation:

from autosklearn.classification import AutoSklearnClassifier
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

def run_autosklearn_timeseries(X: pd.DataFrame, y: pd.Series) -> dict:
    """
    Для временных рядов нельзя использовать обычную CV.
    Используем custom resampling с TimeSeriesSplit.
    """
    tscv = TimeSeriesSplit(n_splits=5)
    cv_splits = list(tscv.split(X))

    # Auto-sklearn принимает кастомные splits через resampling_strategy='custom'
    automl = AutoSklearnClassifier(
        time_left_for_this_task=300,
        resampling_strategy='cv',    # custom splits через параметр ниже
        resampling_strategy_arguments={'folds': 5},
        seed=42
    )

    # Примечание: полноценный timeseries CV в auto-sklearn v1
    # требует monkey-patching или переключения на FLAML/Optuna
    automl.fit(X.values, y.values)
    return automl

Export and deploy

Saving the model:

import pickle
import joblib

def export_autosklearn_model(automl, output_path: str):
    """
    Auto-sklearn использует sklearn Pipeline под капотом.
    Сохранение через joblib — стандартный sklearn путь.
    """
    # Полная модель (включая ансамбль)
    joblib.dump(automl, f'{output_path}/autosklearn_ensemble.pkl')

    # Только лучшая единичная модель (меньше зависимостей)
    best_model = list(automl.get_models_with_weights())[-1][1]
    joblib.dump(best_model, f'{output_path}/best_single_model.pkl')

    return {'ensemble_path': f'{output_path}/autosklearn_ensemble.pkl'}

Timeframe: Basic AutoSklearn + evaluation — 1-2 days. Space customization, custom preprocessors, correct CV timeseries, MLflow tracking — 1-2 weeks.