Pipeliner Class Usage

The task is simple: find the best combination of pre-processing steps and predictive models with respect to an objective criterion. Logistically this can be problematic: a small example might involve three classification models, and two data preprocessing steps with two possible variations for each — overall 12 combinations. For each of these combinations we would like to perform a grid search of predefined hyperparameters on a fixed cross-validation dataset, computing performance metrics for each option (for example ROC AUC). Clearly this can become complicated quickly. On the other hand, many of these combinations share substeps, and re-running such shared steps amounts to a loss of compute time.

1. Defining Pipelines Steps and Grid Search Parameters

The researcher specifies the possible processing steps and the scikit objects involved, then Reskit expands these steps to each possible pipeline. Reskit represents these pipelines in a convenient pandas dataframe, so the researcher can directly visualize and manipulate the experiments.

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA

from sklearn.model_selection import StratifiedKFold

from reskit.core import Pipeliner

# Feature selection and feature extraction step variants (1st step)
feature_engineering = [('VT', VarianceThreshold()),
                       ('PCA', PCA())]

# Preprocessing step variants (2nd step)
scalers = [('standard', StandardScaler()),
           ('minmax', MinMaxScaler())]

# Models (3rd step)
classifiers = [('LR', LogisticRegression()),
               ('SVC', SVC()),
               ('SGD', SGDClassifier())]

# Reskit needs to define steps in this manner
steps = [('feature_engineering', feature_engineering),
         ('scaler', scalers),
         ('classifier', classifiers)]

# Grid search parameters for our models
param_grid = {'LR': {'penalty': ['l1', 'l2']},
              'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']},
              'SGD': {'penalty': ['elasticnet'],
                      'l1_ratio': [0.1, 0.2, 0.3]}}

# Quality metric that we want to optimize
scoring='roc_auc'

# Setting cross-validations
grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipe.plan_table
  feature_engineering scaler classifier
0 VT standard LR
1 VT standard SVC
2 VT standard SGD
3 VT minmax LR
4 VT minmax SVC
5 VT minmax SGD
6 PCA standard LR
7 PCA standard SVC
8 PCA standard SGD
9 PCA minmax LR
10 PCA minmax SVC
11 PCA minmax SGD

2. Forbidden combinations

In case you don’t want to use minmax scaler with SVC, you can define banned combo:

banned_combos = [('minmax', 'SVC')]
pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid, banned_combos=banned_combos)
pipe.plan_table
  feature_engineering scaler classifier
0 VT standard LR
1 VT standard SVC
2 VT standard SGD
3 VT minmax LR
4 VT minmax SGD
5 PCA standard LR
6 PCA standard SVC
7 PCA standard SGD
8 PCA minmax LR
9 PCA minmax SGD

3. Launching Experiment

Reskit then runs each experiment and presents results which are provided to the user through a pandas dataframe. For each pipeline’s classifier, Reskit grid search on cross-validation to find the best classifier’s parameters and report metric mean and standard deviation for each tested pipeline (ROC AUC in this case).

from sklearn.datasets import make_classification


X, y = make_classification()
pipe.get_results(X, y, scoring=['roc_auc'])
Line: 1/10
Line: 2/10
Line: 3/10
Line: 4/10
Line: 5/10
Line: 6/10
Line: 7/10
Line: 8/10
Line: 9/10
Line: 10/10
  feature_engineering scaler classifier grid_roc_auc_mean grid_roc_auc_std grid_roc_auc_best_params eval_roc_auc_mean eval_roc_auc_std eval_roc_auc_scores
0 VT standard LR 0.98 0.0109544511501 {‘penalty’: ‘l1’} 0.978 0.024 [ 0.99 1. 1. 0.96 0.94]
1 VT standard SVC 0.97 0.0289827534924 {‘kernel’: ‘sigmoid’} 0.972 0.036551333765 [ 1. 1. 1. 0.95 0.91]
2 VT standard SGD 0.968 0.0203960780544 {‘l1_ratio’: 0.3, ‘penalty’: ‘elasticnet’} 0.958 0.0213541565041 [ 0.98 0.92 0.97 0.97 0.95]
3 VT minmax LR 0.98 0.0141421356237 {‘penalty’: ‘l1’} 0.978 0.0203960780544 [ 0.96 1. 1. 0.98 0.95]
4 VT minmax SGD 0.968 0.0193907194297 {‘l1_ratio’: 0.2, ‘penalty’: ‘elasticnet’} 0.966 0.0422374241639 [ 0.99 1. 1. 0.95 0.89]
5 PCA standard LR 0.978 0.0116619037897 {‘penalty’: ‘l1’} 0.982 0.0193907194297 [ 1. 1. 0.99 0.95 0.97]
6 PCA standard SVC 0.958 0.0263818119165 {‘kernel’: ‘sigmoid’} 0.956 0.054258639865 [ 1. 1. 1. 0.88 0.9 ]
7 PCA standard SGD 0.918 0.0426145515053 {‘l1_ratio’: 0.3, ‘penalty’: ‘elasticnet’} 0.94 0.0433589667774 [ 0.98 0.96 0.97 0.86 0.93]
8 PCA minmax LR 0.97 0.0352136337233 {‘penalty’: ‘l2’} 0.936 0.0705974503789 [ 1. 1. 0.97 0.82 0.89]
9 PCA minmax SGD 0.946 0.032 {‘l1_ratio’: 0.1, ‘penalty’: ‘elasticnet’} 0.934 0.0697423830967 [ 1. 1. 0.97 0.84 0.86]

4. Caching intermediate steps

Reskit also allows you to cache interim calculations to avoid unnecessary recalculations.

from sklearn.preprocessing import Binarizer

# Simple binarization step that we want ot cache
binarizer = [('binarizer', Binarizer())]

# Reskit needs to define steps in this manner
steps = [('binarizer', binarizer),
         ('classifier', classifiers)]

pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipe.plan_table
  binarizer classifier
0 binarizer LR
1 binarizer SVC
2 binarizer SGD
pipe.get_results(X, y, caching_steps=['binarizer'])
Line: 1/3
Line: 2/3
Line: 3/3
  binarizer classifier grid_accuracy_mean grid_accuracy_std grid_accuracy_best_params eval_accuracy_mean eval_accuracy_std eval_accuracy_scores
0 binarizer LR 0.92 0.0244948974278 {‘penalty’: ‘l1’} 0.92 0.0244948974278 [ 0.95 0.9 0.95 0.9 0.9 ]
1 binarizer SVC 0.92 0.0244948974278 {‘kernel’: ‘rbf’} 0.92 0.0244948974278 [ 0.95 0.9 0.95 0.9 0.9 ]
2 binarizer SGD 0.85 0.0894427191 {‘l1_ratio’: 0.2, ‘penalty’: ‘elasticnet’} 0.82 0.0812403840464 [ 0.9 0.85 0.9 0.75 0.7 ]

Last cached calculations stored in _cached_X

pipe._cached_X
OrderedDict([('init',
              array([[-0.34004591,  0.07223225, -0.10297704, ...,  1.55809216,
                      -1.84967225,  1.20716726],
                     [-0.61534739, -0.2666859 , -1.21834152, ..., -1.31814689,
                       0.97544639, -1.21321157],
                     [ 1.08934663,  0.12345205,  0.09360395, ..., -0.50379748,
                      -0.03416718,  1.51609726],
                     ...,
                     [-1.06428161, -0.22220536, -2.87462458, ..., -0.17236827,
                      -0.22141068,  2.76238087],
                     [ 0.40555432,  0.12063241,  1.1565546 , ...,  1.71135941,
                       0.29149897, -0.67978708],
                     [-0.47521282,  0.11614697,  0.45649735, ..., -0.15355913,
                       0.19643313,  0.67876913]])),
             ('binarizer', array([[ 0.,  1.,  0., ...,  1.,  0.,  1.],
                     [ 0.,  0.,  0., ...,  0.,  1.,  0.],
                     [ 1.,  1.,  1., ...,  0.,  0.,  1.],
                     ...,
                     [ 0.,  0.,  0., ...,  0.,  0.,  1.],
                     [ 1.,  1.,  1., ...,  1.,  1.,  0.],
                     [ 0.,  1.,  1., ...,  0.,  1.,  1.]]))])