reskit.core
¶
Core classes.
-
class
reskit.core.
Pipeliner
(steps, grid_cv, eval_cv, param_grid={}, banned_combos=[])[source]¶ An object which allows you to test different data preprocessing pipelines and prediction models at once.
You will need to specify a name of each preprocessing and prediction step and possible objects performing each step. Then Pipeliner will combine these steps to different pipelines, excluding forbidden combinations; perform experiments according to these steps and present results in convenient csv table. For example, for each pipeline’s classifier, Pipeliner will grid search on cross-validation to find the best classifier’s parameters and report metric mean and std for each tested pipeline. Pipeliner also allows you to cache interim calculations to avoid unnecessary recalculations.
Parameters: steps : list of tuples
List of (step_name, transformers) tuples, where transformers is a list of tuples (step_transformer_name, transformer).
Pipeliner
will createplan_table
from thissteps
, combining all possible combinations of transformers, switching transformers on each step.eval_cv : int, cross-validation generator or an iterable, optional
Determines the evaluation cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a
(Stratified)KFold
, - An object to be used as cross-validation generator.
- A list or iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used.Refer scikit-learn
User Guide
for the various cross-validation strategies that can be used here.grid_cv : int, cross-validation generator or an iterable, optional
Determines the grid search cross-validation splitting strategy. Possible inputs for cv are the same as for
eval_cv
.param_grid : dict of dictionaries
Dictionary with classifiers names (string) as keys. The keys are possible classifiers names in
steps
. Each key corresponds to grid search parameters.banned_combos : list of tuples
List of (transformer_name_1, transformer_name_2) tuples. Each row with both transformers will be removed from
plan_table
.Examples
>>> from sklearn.datasets import make_classification >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import MinMaxScaler >>> from sklearn.model_selection import StratifiedKFold >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.svm import SVC >>> from reskit.core import Pipeliner
>>> X, y = make_classification()
>>> scalers = [('minmax', MinMaxScaler()), ('standard', StandardScaler())] >>> classifiers = [('LR', LogisticRegression()), ('SVC', SVC())] >>> steps = [('Scaler', scalers), ('Classifier', classifiers)]
>>> grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) >>> eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
>>> param_grid = {'LR' : {'penalty' : ['l1', 'l2']}, >>> 'SVC' : {'kernel' : ['linear', 'poly', 'rbf', 'sigmoid']}}
>>> pipe = Pipeliner(steps, eval_cv=eval_cv, grid_cv=grid_cv, param_grid=param_grid) >>> pipe.get_results(X=X, y=y, scoring=['roc_auc'])
Attributes
plan_table (pandas DataFrame) Plan of pipelines evaluation. Created from steps
.named_steps: dict of dictionaries Dictionary with steps names as keys. Each key corresponds to dictionary with transformers names from steps
as keys. You can get any transformer object from this dictionary.-
get_grid_search_results
(X, y, row_keys, scoring)[source]¶ Make grid search for pipeline, created from
row_keys
for definedscoring
.Parameters: X : array-like
The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.
y : array-like, optional, default: None
The target variable to try to predict in the case of supervised learning.
row_keys : list of strings
List of transformers names.
Pipeliner
takes transformers fromnamed_steps
using keys fromrow_keys
and creates pipeline to transform.scoring : string, callable or None, default=None
A string (see model evaluation documentation) or a scorer callable object / function with signature
scorer(estimator, X, y)
. If None, the score method of the estimator is used.Returns: results : dict
Dictionary with keys: ‘grid_{}_mean’, ‘grid_{}_std’ and ‘grid_{}_best_params’. In the middle of keys will be corresponding scoring.
-
get_results
(X, y=None, caching_steps=[], scoring='accuracy', logs_file='results.log', collect_n=None)[source]¶ Gives results dataframe by defined pipelines.
Parameters: X : array-like
The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.
y : array-like, optional, default: None
The target variable to try to predict in the case of supervised learning.
caching_steps : list of strings
Steps which won’t be recalculated for each new pipeline. If in previous pipeline exists the same steps,
Pipeliner
will start from this step.scoring : string, callable or None, default=None
A string (see model evaluation documentation) or a scorer callable object / function with signature
scorer(estimator, X, y)
. If None, the score method of the estimator is used.logs_file : string
File name where logs will be saved.
collect_n : int
If not None scores will be calculated in following way. Each score will be corresponds to average score on cross-validation scores. The only thing that is changing for each score is random_state, it shifts.
Returns: results : DataFrame
Dataframe with all results about pipelines.
-
get_scores
(X, y, row_keys, scoring, collect_n=None)[source]¶ Gives scores for prediction on cross-validation.
Parameters: X : array-like
The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.
y : array-like, optional, default: None
The target variable to try to predict in the case of supervised learning.
row_keys : list of strings
List of transformers names.
Pipeliner
takes transformers fromnamed_steps
using keys fromrow_keys
and creates pipeline to transform.scoring : string, callable or None, default=None
A string (see model evaluation documentation) or a scorer callable object / function with signature
scorer(estimator, X, y)
. If None, the score method of the estimator is used.collect_n : list of strings
List of keys from data dictionary you want to collect and create feature vectors.
Returns: scores : array-like
Scores calculated on cross-validation.
-
transform_with_caching
(X, y, row_keys)[source]¶ Transforms
X
with caching.Parameters: X : array-like
The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.
y : array-like, optional, default: None
The target variable to try to predict in the case of supervised learning.
row_keys : list of strings
List of transformers names.
Pipeliner
takes transformers fromnamed_steps
using keys fromrow_keys
and creates pipeline to transform.Returns: transformed_data : (X, y) tuple, where X and y array-like
Data transformed corresponding to pipeline, created from
row_keys
, to (X, y) tuple.
-
class
reskit.core.
MatrixTransformer
(func, **params)[source]¶ Helps to add you own transformation through usual functions.
Parameters: func : function
A function that transforms input data.
params : dict
Parameters for the function.
-
fit
(X, y=None, **fit_params)[source]¶ Fits the data.
Parameters: X : array-like
The data to fit. Should be a 3D array.
y : array-like, optional, default: None
The target variable to try to predict in the case of supervised learning.
-
transform
(X, y=None)[source]¶ Transforms the data according to function you set.
Parameters: X : array-like
The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.
y : array-like, optional, default: None
The target variable to try to predict in the case of supervised learning.
-
-
class
reskit.core.
DataTransformer
(func, **params)[source]¶ Helps to add you own transformation through usual functions.
Parameters: func : function
A function that transforms input data.
params : dict
Parameters for the function.
-
fit
(X, y=None, **fit_params)[source]¶ Fits the data.
Parameters: X : array-like
The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.
y : array-like, optional, default: None
The target variable to try to predict in the case of supervised learning.
-
transform
(X, y=None)[source]¶ Transforms the data according to function you set.
Parameters: X : array-like
The data to fit. Can be, for example a list, or an array at least 2d, or dictionary.
y : array-like, optional, default: None
The target variable to try to predict in the case of supervised learning.
-