reskit.core

Core classes.

class reskit.core.Pipeliner(steps, grid_cv, eval_cv, param_grid={}, banned_combos=[])[source]

An object which allows you to test different data preprocessing pipelines and prediction models at once.

You will need to specify a name of each preprocessing and prediction step and possible objects performing each step. Then Pipeliner will combine these steps to different pipelines, excluding forbidden combinations; perform experiments according to these steps and present results in convenient csv table. For example, for each pipeline’s classifier, Pipeliner will grid search on cross-validation to find the best classifier’s parameters and report metric mean and std for each tested pipeline. Pipeliner also allows you to cache interim calculations to avoid unnecessary recalculations.

Parameters:

steps : list of tuples

List of (step_name, transformers) tuples, where transformers is a list of tuples (step_transformer_name, transformer). Pipeliner will create plan_table from this steps, combining all possible combinations of transformers, switching transformers on each step.

eval_cv : int, cross-validation generator or an iterable, optional

Determines the evaluation cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 3-fold cross validation,
  • integer, to specify the number of folds in a (Stratified)KFold,
  • An object to be used as cross-validation generator.
  • A list or iterable yielding train, test splits.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

Refer scikit-learn User Guide for the various cross-validation strategies that can be used here.

grid_cv : int, cross-validation generator or an iterable, optional

Determines the grid search cross-validation splitting strategy. Possible inputs for cv are the same as for eval_cv.

param_grid : dict of dictionaries

Dictionary with classifiers names (string) as keys. The keys are possible classifiers names in steps. Each key corresponds to grid search parameters.

banned_combos : list of tuples

List of (transformer_name_1, transformer_name_2) tuples. Each row with both transformers will be removed from plan_table.

Examples

>>> from sklearn.datasets import make_classification
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.preprocessing import MinMaxScaler
>>> from sklearn.model_selection import StratifiedKFold
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.svm import SVC
>>> from reskit.core import Pipeliner
>>> X, y = make_classification()
>>> scalers = [('minmax', MinMaxScaler()), ('standard', StandardScaler())]
>>> classifiers = [('LR', LogisticRegression()), ('SVC', SVC())]
>>> steps = [('Scaler', scalers), ('Classifier', classifiers)]
>>> grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
>>> eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
>>> param_grid = {'LR' : {'penalty' : ['l1', 'l2']},
>>>               'SVC' : {'kernel' : ['linear', 'poly', 'rbf', 'sigmoid']}}
>>> pipe = Pipeliner(steps, eval_cv=eval_cv, grid_cv=grid_cv, param_grid=param_grid)
>>> pipe.get_results(X=X, y=y, scoring=['roc_auc'])

Attributes

plan_table (pandas DataFrame) Plan of pipelines evaluation. Created from steps.
named_steps: dict of dictionaries Dictionary with steps names as keys. Each key corresponds to dictionary with transformers names from steps as keys. You can get any transformer object from this dictionary.
get_grid_search_results(X, y, row_keys, scoring)[source]

Make grid search for pipeline, created from row_keys for defined scoring.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

row_keys : list of strings

List of transformers names. Pipeliner takes transformers from named_steps using keys from row_keys and creates pipeline to transform.

scoring : string, callable or None, default=None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y). If None, the score method of the estimator is used.

Returns:

results : dict

Dictionary with keys: ‘grid_{}_mean’, ‘grid_{}_std’ and ‘grid_{}_best_params’. In the middle of keys will be corresponding scoring.

get_results(X, y=None, caching_steps=[], scoring='accuracy', logs_file='results.log', collect_n=None)[source]

Gives results dataframe by defined pipelines.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

caching_steps : list of strings

Steps which won’t be recalculated for each new pipeline. If in previous pipeline exists the same steps, Pipeliner will start from this step.

scoring : string, callable or None, default=None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y). If None, the score method of the estimator is used.

logs_file : string

File name where logs will be saved.

collect_n : int

If not None scores will be calculated in following way. Each score will be corresponds to average score on cross-validation scores. The only thing that is changing for each score is random_state, it shifts.

Returns:

results : DataFrame

Dataframe with all results about pipelines.

get_scores(X, y, row_keys, scoring, collect_n=None)[source]

Gives scores for prediction on cross-validation.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

row_keys : list of strings

List of transformers names. Pipeliner takes transformers from named_steps using keys from row_keys and creates pipeline to transform.

scoring : string, callable or None, default=None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y). If None, the score method of the estimator is used.

collect_n : list of strings

List of keys from data dictionary you want to collect and create feature vectors.

Returns:

scores : array-like

Scores calculated on cross-validation.

transform_with_caching(X, y, row_keys)[source]

Transforms X with caching.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

row_keys : list of strings

List of transformers names. Pipeliner takes transformers from named_steps using keys from row_keys and creates pipeline to transform.

Returns:

transformed_data : (X, y) tuple, where X and y array-like

Data transformed corresponding to pipeline, created from row_keys, to (X, y) tuple.

class reskit.core.MatrixTransformer(func, **params)[source]

Helps to add you own transformation through usual functions.

Parameters:

func : function

A function that transforms input data.

params : dict

Parameters for the function.

fit(X, y=None, **fit_params)[source]

Fits the data.

Parameters:

X : array-like

The data to fit. Should be a 3D array.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

transform(X, y=None)[source]

Transforms the data according to function you set.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

class reskit.core.DataTransformer(func, **params)[source]

Helps to add you own transformation through usual functions.

Parameters:

func : function

A function that transforms input data.

params : dict

Parameters for the function.

fit(X, y=None, **fit_params)[source]

Fits the data.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

transform(X, y=None)[source]

Transforms the data according to function you set.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.