Overview¶
Reskit (researcher’s kit) is a library for creating and curating reproducible
pipelines for scientific and industrial machine learning. The natural extension
of the scikit-learn
Pipelines to general classes of pipelines, Reskit
allows for the efficient and transparent optimization of each pipeline step.
Main features include data caching, compatibility with most of the scikit-learn
objects, optimization constraints (e.g. forbidden combinations), and table
generation for quality metrics. Reskit also allows for the injection of custom
metrics into the underlying scikit frameworks. Reskit is intended for use by
researchers who need pipelines amenable to versioning and reproducibility, yet
who also have a large volume of experiments to run.
Features¶
- Ability to combine pipelines with equal number of steps in list of experiments, running them and returning results in a convenient format for analysis (Pandas dataframe).
- Preprocessing steps caching. Usual SciKit-learn pipelines cannot cache temporary steps. We provide an opportunity to save fixed steps, so in next pipeline already calculated steps won’t be recalculated.
- Ability to set “forbidden combinations” for chosen steps of a pipeline. It helps to test only needed pipelines, not all possible combinations.
- Full compatibility with scikit-learn objects. It means you can use in Reskit any scikit-learn data transforming object or any predictive model implemented in scikit-learn.
- Evaluating experiments using several performance metrics.
- Creation of transformers for your own tasks through DataTransformer class, which allows you to use your functions as data processing steps in pipelines.
- Tools for machine learning on networks, particularly, for connectomics. Particularly, you can normalize adjacency matrices of graphs and calculate state-of-the-art local metrics using DataTransformer and BCTpy (Brain Connectivity Toolbox python version) or use some implemented in Reskit metrics [3]
Getting started: A Short Introduction to Reskit¶
Let’s say we want to prepare data and try some scalers and classifiers for prediction in a classification problem. We will tune paramaters of classifiers by grid search technique.
Data preparing:
from sklearn.datasets import make_classification
X, y = make_classification()
Setting steps for our pipelines and parameters for grid search:
from reskit.core import Pipeliner
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
classifiers = [('LR', LogisticRegression()),
('SVC', SVC())]
scalers = [('standard', StandardScaler()),
('minmax', MinMaxScaler())]
steps = [('scaler', scalers),
('classifier', classifiers)]
param_grid = {'LR': {'penalty': ['l1', 'l2']},
'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}}
Setting a cross-validation for grid searching of hyperparameters and for evaluation of models with obtained hyperparameters.
from sklearn.model_selection import StratifiedKFold
grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
Creating a plan of our research:
pipeliner = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipeliner.plan_table
scaler | classifier | |
0 | standard | LR |
1 | standard | SVC |
2 | minmax | LR |
3 | minmax | SVC |
To tune parameters of models and evaluate this models, run:
pipeliner.get_results(X, y, scoring=['roc_auc'])
Line: 1/4
Line: 2/4
Line: 3/4
Line: 4/4
scaler | classifier | grid_roc_auc_mean | grid_roc_auc_std | grid_roc_auc_best_params | eval_roc_auc_mean | eval_roc_auc_std | eval_roc_auc_scores | |
0 | standard | LR | 0.956 | 0.0338230690506 | {‘penalty’: ‘l1’} | 0.968 | 0.0324961536185 | [ 0.92 1. 1. 0.94 0.98] |
1 | standard | SVC | 0.962 | 0.0278567765544 | {‘kernel’: ‘poly’} | 0.976 | 0.0300665927567 | [ 0.95 1. 1. 0.93 1. ] |
2 | minmax | LR | 0.964 | 0.0412795348811 | {‘penalty’: ‘l1’} | 0.966 | 0.0377359245282 | [ 0.92 1. 1. 0.92 0.99] |
3 | minmax | SVC | 0.958 | 0.0411825205639 | {‘kernel’: ‘rbf’} | 0.962 | 0.0401995024845 | [ 0.93 1. 1. 0.9 0.98] |
Installation¶
Reskit currently requires Python 3.4
or later to run.
Please install Python
and pip
via the package manager of your operating system if it is not included already.
- Reskit depends on:
To install dependencies run next command:
pip install -r https://raw.githubusercontent.com/neuro-ml/reskit/master/requirements.txt
To install stable version, run the following command:
pip install -U https://github.com/neuro-ml/reskit/archive/master.zip
To install latest development version of Reskit, run the following commands:
pip install https://github.com/neuro-ml/reskit/archive/master.zip
Some reskit functions depends on:
You may install it via:
pip install -r https://raw.githubusercontent.com/nuro-ml/reskit/master/requirements_additional.txt
Docker¶
If you just want to try Reskit or don’t want to install Python, you can build docker image and make all reskit’s stuff there. Also, in this case, you can provide the simple way to reproduce your experiment. To run Reskit in docker you can use next commands.
- Clone:
git clone https://github.com/neuro-ml/reskit.git
cd reskit
- Build:
docker build -t docker-reskit -f Dockerfile .
- Run container.
- If you want to run bash in container:
docker run -it docker-reskit bash
- If you want to run bash in container with shared directory:
docker run -v $PWD/scripts:/reskit/scripts -it -p 8809:8888 docker-reskit bashNote
Files won’t be deleted after stopping container if you save this files in shared directory.
- If you want to start Jupyter Notebook server at
http://localhost:8809
in container:docker run -v $PWD/scripts:/reskit/scripts -it -p 8809:8888 docker-reskit jupyter notebook --no-browser --ip="*"Open http://localhost:8809 on your local machine in a web browser.