Overview

Reskit (researcher’s kit) is a library for creating and curating reproducible pipelines for scientific and industrial machine learning. The natural extension of the scikit-learn Pipelines to general classes of pipelines, Reskit allows for the efficient and transparent optimization of each pipeline step. Main features include data caching, compatibility with most of the scikit-learn objects, optimization constraints (e.g. forbidden combinations), and table generation for quality metrics. Reskit also allows for the injection of custom metrics into the underlying scikit frameworks. Reskit is intended for use by researchers who need pipelines amenable to versioning and reproducibility, yet who also have a large volume of experiments to run.

Features

  • Ability to combine pipelines with equal number of steps in list of experiments, running them and returning results in a convenient format for analysis (Pandas dataframe).
  • Preprocessing steps caching. Usual SciKit-learn pipelines cannot cache temporary steps. We provide an opportunity to save fixed steps, so in next pipeline already calculated steps won’t be recalculated.
  • Ability to set “forbidden combinations” for chosen steps of a pipeline. It helps to test only needed pipelines, not all possible combinations.
  • Full compatibility with scikit-learn objects. It means you can use in Reskit any scikit-learn data transforming object or any predictive model implemented in scikit-learn.
  • Evaluating experiments using several performance metrics.
  • Creation of transformers for your own tasks through DataTransformer class, which allows you to use your functions as data processing steps in pipelines.
  • Tools for machine learning on networks, particularly, for connectomics. Particularly, you can normalize adjacency matrices of graphs and calculate state-of-the-art local metrics using DataTransformer and BCTpy (Brain Connectivity Toolbox python version) or use some implemented in Reskit metrics [3]

Getting started: A Short Introduction to Reskit

Let’s say we want to prepare data and try some scalers and classifiers for prediction in a classification problem. We will tune paramaters of classifiers by grid search technique.

Data preparing:

from sklearn.datasets import make_classification


X, y = make_classification()

Setting steps for our pipelines and parameters for grid search:

from reskit.core import Pipeliner

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


classifiers = [('LR', LogisticRegression()),
               ('SVC', SVC())]

scalers = [('standard', StandardScaler()),
           ('minmax', MinMaxScaler())]

steps = [('scaler', scalers),
         ('classifier', classifiers)]

param_grid = {'LR': {'penalty': ['l1', 'l2']},
              'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}}

Setting a cross-validation for grid searching of hyperparameters and for evaluation of models with obtained hyperparameters.

from sklearn.model_selection import StratifiedKFold


grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

Creating a plan of our research:

pipeliner = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipeliner.plan_table
  scaler classifier
0 standard LR
1 standard SVC
2 minmax LR
3 minmax SVC

To tune parameters of models and evaluate this models, run:

pipeliner.get_results(X, y, scoring=['roc_auc'])
Line: 1/4
Line: 2/4
Line: 3/4
Line: 4/4
  scaler classifier grid_roc_auc_mean grid_roc_auc_std grid_roc_auc_best_params eval_roc_auc_mean eval_roc_auc_std eval_roc_auc_scores
0 standard LR 0.956 0.0338230690506 {‘penalty’: ‘l1’} 0.968 0.0324961536185 [ 0.92 1. 1. 0.94 0.98]
1 standard SVC 0.962 0.0278567765544 {‘kernel’: ‘poly’} 0.976 0.0300665927567 [ 0.95 1. 1. 0.93 1. ]
2 minmax LR 0.964 0.0412795348811 {‘penalty’: ‘l1’} 0.966 0.0377359245282 [ 0.92 1. 1. 0.92 0.99]
3 minmax SVC 0.958 0.0411825205639 {‘kernel’: ‘rbf’} 0.962 0.0401995024845 [ 0.93 1. 1. 0.9 0.98]

Installation

Reskit currently requires Python 3.4 or later to run. Please install Python and pip via the package manager of your operating system if it is not included already.

Reskit depends on:

To install dependencies run next command:

pip install -r https://raw.githubusercontent.com/neuro-ml/reskit/master/requirements.txt

To install stable version, run the following command:

pip install -U https://github.com/neuro-ml/reskit/archive/master.zip

To install latest development version of Reskit, run the following commands:

pip install https://github.com/neuro-ml/reskit/archive/master.zip

Some reskit functions depends on:

You may install it via:

pip install -r https://raw.githubusercontent.com/nuro-ml/reskit/master/requirements_additional.txt

Docker

If you just want to try Reskit or don’t want to install Python, you can build docker image and make all reskit’s stuff there. Also, in this case, you can provide the simple way to reproduce your experiment. To run Reskit in docker you can use next commands.

  1. Clone:
git clone https://github.com/neuro-ml/reskit.git
cd reskit
  1. Build:
docker build -t docker-reskit -f Dockerfile .
  1. Run container.
  1. If you want to run bash in container:
docker run -it docker-reskit bash
  1. If you want to run bash in container with shared directory:
docker run -v $PWD/scripts:/reskit/scripts -it -p 8809:8888 docker-reskit bash

Note

Files won’t be deleted after stopping container if you save this files in shared directory.

  1. If you want to start Jupyter Notebook server at http://localhost:8809 in container:
docker run -v $PWD/scripts:/reskit/scripts -it -p 8809:8888 docker-reskit jupyter notebook --no-browser --ip="*"

Open http://localhost:8809 on your local machine in a web browser.