`reskit.core`¶

Core classes.

class reskit.core.Pipeliner(steps, grid_cv, eval_cv, param_grid={}, banned_combos=[])[source]¶

An object which allows you to test different data preprocessing pipelines and prediction models at once.

You will need to specify a name of each preprocessing and prediction step and possible objects performing each step. Then Pipeliner will combine these steps to different pipelines, excluding forbidden combinations; perform experiments according to these steps and present results in convenient csv table. For example, for each pipeline’s classifier, Pipeliner will grid search on cross-validation to find the best classifier’s parameters and report metric mean and std for each tested pipeline. Pipeliner also allows you to cache interim calculations to avoid unnecessary recalculations.

Parameters:

steps : list of tuples

List of (step_name, transformers) tuples, where transformers is a list of tuples (step_transformer_name, transformer). Pipeliner will create plan_table from this steps, combining all possible combinations of transformers, switching transformers on each step.

eval_cv : int, cross-validation generator or an iterable, optional

Determines the evaluation cross-validation splitting strategy. Possible inputs for cv are:

None, to use the default 3-fold cross validation,

integer, to specify the number of folds in a (Stratified)KFold,

An object to be used as cross-validation generator.

A list or iterable yielding train, test splits.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

Refer scikit-learn User Guide for the various cross-validation strategies that can be used here.

grid_cv : int, cross-validation generator or an iterable, optional

Determines the grid search cross-validation splitting strategy. Possible inputs for cv are the same as for eval_cv.

param_grid : dict of dictionaries

Dictionary with classifiers names (string) as keys. The keys are possible classifiers names in steps. Each key corresponds to grid search parameters.

banned_combos : list of tuples

List of (transformer_name_1, transformer_name_2) tuples. Each row with both transformers will be removed from plan_table.

Examples

>>> from sklearn.datasets import make_classification
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.preprocessing import MinMaxScaler
>>> from sklearn.model_selection import StratifiedKFold
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.svm import SVC
>>> from reskit.core import Pipeliner

>>> X, y = make_classification()

>>> scalers = [('minmax', MinMaxScaler()), ('standard', StandardScaler())]
>>> classifiers = [('LR', LogisticRegression()), ('SVC', SVC())]
>>> steps = [('Scaler', scalers), ('Classifier', classifiers)]

>>> grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
>>> eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

>>> param_grid = {'LR' : {'penalty' : ['l1', 'l2']},
>>>               'SVC' : {'kernel' : ['linear', 'poly', 'rbf', 'sigmoid']}}

>>> pipe = Pipeliner(steps, eval_cv=eval_cv, grid_cv=grid_cv, param_grid=param_grid)
>>> pipe.get_results(X=X, y=y, scoring=['roc_auc'])

Attributes

plan_table	(pandas DataFrame) Plan of pipelines evaluation. Created from `steps`.
named_steps: dict of dictionaries	Dictionary with steps names as keys. Each key corresponds to dictionary with transformers names from `steps` as keys. You can get any transformer object from this dictionary.

get_grid_search_results(X, y, row_keys, scoring)[source]¶

Make grid search for pipeline, created from row_keys for defined scoring.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

row_keys : list of strings

List of transformers names. Pipeliner takes transformers from named_steps using keys from row_keys and creates pipeline to transform.

scoring : string, callable or None, default=None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y). If None, the score method of the estimator is used.

Returns:

results : dict

Dictionary with keys: ‘grid_{}_mean’, ‘grid_{}_std’ and ‘grid_{}_best_params’. In the middle of keys will be corresponding scoring.

get_results(X, y=None, caching_steps=[], scoring='accuracy', logs_file='results.log', collect_n=None)[source]¶

Gives results dataframe by defined pipelines.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

caching_steps : list of strings

Steps which won’t be recalculated for each new pipeline. If in previous pipeline exists the same steps, Pipeliner will start from this step.

scoring : string, callable or None, default=None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y). If None, the score method of the estimator is used.

logs_file : string

File name where logs will be saved.

collect_n : int

If not None scores will be calculated in following way. Each score will be corresponds to average score on cross-validation scores. The only thing that is changing for each score is random_state, it shifts.

Returns:

results : DataFrame

Dataframe with all results about pipelines.

get_scores(X, y, row_keys, scoring, collect_n=None)[source]¶

Gives scores for prediction on cross-validation.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

row_keys : list of strings

List of transformers names. Pipeliner takes transformers from named_steps using keys from row_keys and creates pipeline to transform.

scoring : string, callable or None, default=None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y). If None, the score method of the estimator is used.

collect_n : list of strings

List of keys from data dictionary you want to collect and create feature vectors.

Returns:

scores : array-like

Scores calculated on cross-validation.

transform_with_caching(X, y, row_keys)[source]¶

Transforms X with caching.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

row_keys : list of strings

List of transformers names. Pipeliner takes transformers from named_steps using keys from row_keys and creates pipeline to transform.

Returns:

transformed_data : (X, y) tuple, where X and y array-like

Data transformed corresponding to pipeline, created from row_keys, to (X, y) tuple.

class reskit.core.MatrixTransformer(func, **params)[source]¶

Helps to add you own transformation through usual functions.

Parameters:

func : function

A function that transforms input data.

params : dict

Parameters for the function.

fit(X, y=None, **fit_params)[source]¶

Fits the data.

Parameters:

X : array-like

The data to fit. Should be a 3D array.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

transform(X, y=None)[source]¶

Transforms the data according to function you set.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

class reskit.core.DataTransformer(func, **params)[source]¶

Helps to add you own transformation through usual functions.

Parameters:

func : function

A function that transforms input data.

params : dict

Parameters for the function.

fit(X, y=None, **fit_params)[source]¶

Fits the data.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

transform(X, y=None)[source]¶

Transforms the data according to function you set.

Parameters:

X : array-like

The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.

y : array-like, optional, default: None

The target variable to try to predict in the case of supervised learning.

reskit.core¶

`reskit.core`¶