Pipeliner Class Usage¶

The task is simple: find the best combination of pre-processing steps and predictive models with respect to an objective criterion. Logistically this can be problematic: a small example might involve three classification models, and two data preprocessing steps with two possible variations for each — overall 12 combinations. For each of these combinations we would like to perform a grid search of predefined hyperparameters on a fixed cross-validation dataset, computing performance metrics for each option (for example ROC AUC). Clearly this can become complicated quickly. On the other hand, many of these combinations share substeps, and re-running such shared steps amounts to a loss of compute time.

1. Defining Pipelines Steps and Grid Search Parameters¶

The researcher specifies the possible processing steps and the scikit objects involved, then Reskit expands these steps to each possible pipeline. Reskit represents these pipelines in a convenient pandas dataframe, so the researcher can directly visualize and manipulate the experiments.

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA

from sklearn.model_selection import StratifiedKFold

from reskit.core import Pipeliner

# Feature selection and feature extraction step variants (1st step)
feature_engineering = [('VT', VarianceThreshold()),
                       ('PCA', PCA())]

# Preprocessing step variants (2nd step)
scalers = [('standard', StandardScaler()),
           ('minmax', MinMaxScaler())]

# Models (3rd step)
classifiers = [('LR', LogisticRegression()),
               ('SVC', SVC()),
               ('SGD', SGDClassifier())]

# Reskit needs to define steps in this manner
steps = [('feature_engineering', feature_engineering),
         ('scaler', scalers),
         ('classifier', classifiers)]

# Grid search parameters for our models
param_grid = {'LR': {'penalty': ['l1', 'l2']},
              'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']},
              'SGD': {'penalty': ['elasticnet'],
                      'l1_ratio': [0.1, 0.2, 0.3]}}

# Quality metric that we want to optimize
scoring='roc_auc'

# Setting cross-validations
grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipe.plan_table

	feature_engineering	scaler	classifier
0	VT	standard	LR
1	VT	standard	SVC
2	VT	standard	SGD
3	VT	minmax	LR
4	VT	minmax	SVC
5	VT	minmax	SGD
6	PCA	standard	LR
7	PCA	standard	SVC
8	PCA	standard	SGD
9	PCA	minmax	LR
10	PCA	minmax	SVC
11	PCA	minmax	SGD

2. Forbidden combinations¶

In case you don’t want to use minmax scaler with SVC, you can define banned combo:

banned_combos = [('minmax', 'SVC')]
pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid, banned_combos=banned_combos)
pipe.plan_table

	feature_engineering	scaler	classifier
0	VT	standard	LR
1	VT	standard	SVC
2	VT	standard	SGD
3	VT	minmax	LR
4	VT	minmax	SGD
5	PCA	standard	LR
6	PCA	standard	SVC
7	PCA	standard	SGD
8	PCA	minmax	LR
9	PCA	minmax	SGD

3. Launching Experiment¶

Reskit then runs each experiment and presents results which are provided to the user through a pandas dataframe. For each pipeline’s classifier, Reskit grid search on cross-validation to find the best classifier’s parameters and report metric mean and standard deviation for each tested pipeline (ROC AUC in this case).

from sklearn.datasets import make_classification


X, y = make_classification()
pipe.get_results(X, y, scoring=['roc_auc'])

Line: 1/10
Line: 2/10
Line: 3/10
Line: 4/10
Line: 5/10
Line: 6/10
Line: 7/10
Line: 8/10
Line: 9/10
Line: 10/10

	feature_engineering	scaler	classifier	grid_roc_auc_mean	grid_roc_auc_std	grid_roc_auc_best_params	eval_roc_auc_mean	eval_roc_auc_std	eval_roc_auc_scores
0	VT	standard	LR	0.98	0.0109544511501	{‘penalty’: ‘l1’}	0.978	0.024	[ 0.99 1. 1. 0.96 0.94]
1	VT	standard	SVC	0.97	0.0289827534924	{‘kernel’: ‘sigmoid’}	0.972	0.036551333765	[ 1. 1. 1. 0.95 0.91]
2	VT	standard	SGD	0.968	0.0203960780544	{‘l1_ratio’: 0.3, ‘penalty’: ‘elasticnet’}	0.958	0.0213541565041	[ 0.98 0.92 0.97 0.97 0.95]
3	VT	minmax	LR	0.98	0.0141421356237	{‘penalty’: ‘l1’}	0.978	0.0203960780544	[ 0.96 1. 1. 0.98 0.95]
4	VT	minmax	SGD	0.968	0.0193907194297	{‘l1_ratio’: 0.2, ‘penalty’: ‘elasticnet’}	0.966	0.0422374241639	[ 0.99 1. 1. 0.95 0.89]
5	PCA	standard	LR	0.978	0.0116619037897	{‘penalty’: ‘l1’}	0.982	0.0193907194297	[ 1. 1. 0.99 0.95 0.97]
6	PCA	standard	SVC	0.958	0.0263818119165	{‘kernel’: ‘sigmoid’}	0.956	0.054258639865	[ 1. 1. 1. 0.88 0.9 ]
7	PCA	standard	SGD	0.918	0.0426145515053	{‘l1_ratio’: 0.3, ‘penalty’: ‘elasticnet’}	0.94	0.0433589667774	[ 0.98 0.96 0.97 0.86 0.93]
8	PCA	minmax	LR	0.97	0.0352136337233	{‘penalty’: ‘l2’}	0.936	0.0705974503789	[ 1. 1. 0.97 0.82 0.89]
9	PCA	minmax	SGD	0.946	0.032	{‘l1_ratio’: 0.1, ‘penalty’: ‘elasticnet’}	0.934	0.0697423830967	[ 1. 1. 0.97 0.84 0.86]

4. Caching intermediate steps¶

Reskit also allows you to cache interim calculations to avoid unnecessary recalculations.

from sklearn.preprocessing import Binarizer

# Simple binarization step that we want ot cache
binarizer = [('binarizer', Binarizer())]

# Reskit needs to define steps in this manner
steps = [('binarizer', binarizer),
         ('classifier', classifiers)]

pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipe.plan_table

	binarizer	classifier
0	binarizer	LR
1	binarizer	SVC
2	binarizer	SGD

pipe.get_results(X, y, caching_steps=['binarizer'])

Line: 1/3
Line: 2/3
Line: 3/3

	binarizer	classifier	grid_accuracy_mean	grid_accuracy_std	grid_accuracy_best_params	eval_accuracy_mean	eval_accuracy_std	eval_accuracy_scores
0	binarizer	LR	0.92	0.0244948974278	{‘penalty’: ‘l1’}	0.92	0.0244948974278	[ 0.95 0.9 0.95 0.9 0.9 ]
1	binarizer	SVC	0.92	0.0244948974278	{‘kernel’: ‘rbf’}	0.92	0.0244948974278	[ 0.95 0.9 0.95 0.9 0.9 ]
2	binarizer	SGD	0.85	0.0894427191	{‘l1_ratio’: 0.2, ‘penalty’: ‘elasticnet’}	0.82	0.0812403840464	[ 0.9 0.85 0.9 0.75 0.7 ]

Last cached calculations stored in _cached_X

pipe._cached_X

OrderedDict([('init',
              array([[-0.34004591,  0.07223225, -0.10297704, ...,  1.55809216,
                      -1.84967225,  1.20716726],
                     [-0.61534739, -0.2666859 , -1.21834152, ..., -1.31814689,
                       0.97544639, -1.21321157],
                     [ 1.08934663,  0.12345205,  0.09360395, ..., -0.50379748,
                      -0.03416718,  1.51609726],
                     ...,
                     [-1.06428161, -0.22220536, -2.87462458, ..., -0.17236827,
                      -0.22141068,  2.76238087],
                     [ 0.40555432,  0.12063241,  1.1565546 , ...,  1.71135941,
                       0.29149897, -0.67978708],
                     [-0.47521282,  0.11614697,  0.45649735, ..., -0.15355913,
                       0.19643313,  0.67876913]])),
             ('binarizer', array([[ 0.,  1.,  0., ...,  1.,  0.,  1.],
                     [ 0.,  0.,  0., ...,  0.,  1.,  0.],
                     [ 1.,  1.,  1., ...,  0.,  0.,  1.],
                     ...,
                     [ 0.,  0.,  0., ...,  0.,  0.,  1.],
                     [ 1.,  1.,  1., ...,  1.,  1.,  0.],
                     [ 0.,  1.,  1., ...,  0.,  1.,  1.]]))])