Machine Learning on Graphs¶

We already used some graph metrics in the previous tutorial. There we will cover graphs metrics and features in details. Also, we will cover Brain Connectivity Toolbox usage.

1. Realworld dataset¶

Here we use UCLA autism dataset publicly available at the UCLA Multimodal Connectivity Database. Data includes DTI-based connectivity matrices of 51 high-functioning ASD subjects (6 females) and 43 TD subjects (7 females).

from reskit.datasets import load_UCLA_data


X, y = load_UCLA_data()
X = X['matrices']

2. Normalizations and Graph Metrics¶

We can normalize and build some metrics.

from reskit.normalizations import mean_norm
from reskit.features import bag_of_edges
from reskit.core import MatrixTransformer


normalized_X = MatrixTransformer(
    func=mean_norm).fit_transform(X)

featured_X = MatrixTransformer(
    func=bag_of_edges).fit_transform(normalized_X)

3. Brain Connectivity Toolbox¶

We provide some basic graph metrics in Reskit. To access most state of the art graph metrics you can use Brain Connectivity Toolbox. You should install it via pip:

sudo pip install bctpy

Let’s calculate pagerank centrality of a random graph using BCT python library.

from bct.algorithms.centrality import pagerank_centrality
import numpy as np


pagerank_centrality(np.random.rand(3,3), d=0.85)

array([ 0.46722034,  0.33387522,  0.19890444])

Now we calculates this metric for UCLA dataset. d is the pagerank_centrality parameter, called damping factor (see bctpy documentation for more info).

featured_X = MatrixTransformer(
    d=0.85,
    func=pagerank_centrality).fit_transform(X)

If we want to try pagerank_centrality and degrees for SVM and LogisticRegression classfiers.

from bct.algorithms.degree import degrees_und

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

from reskit.core import Pipeliner

# Feature extraction step variants (1st step)
featurizers = [('pagerank', MatrixTransformer(
                                d=0.85,
                                func=pagerank_centrality)),
               ('degrees', MatrixTransformer(
                                func=degrees_und))]

# Models (3rd step)
classifiers = [('LR', LogisticRegression()),
               ('SVC', SVC())]

# Reskit needs to define steps in this manner
steps = [('featurizer', featurizers),
         ('classifier', classifiers)]

# Grid search parameters for our models
param_grid = {'LR': {'penalty': ['l1', 'l2']},
              'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}}

# Quality metric that we want to optimize
scoring='roc_auc'

# Setting cross-validations
grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipe.plan_table

	featurizer	classifier	grid_roc_auc_mean	grid_roc_auc_std	grid_roc_auc_best_params	eval_roc_auc_mean	eval_roc_auc_std	eval_roc_auc_scores
0	pagerank	LR	0.584141951429	0.0942090541588	{‘penalty’: ‘l2’}	0.639191919192	0.0805917875518	[ 0.5959596 0.76666667 0.63333333 0.675 0.525 ]
1	pagerank	SVC	0.605372877713	0.144537957686	{‘kernel’: ‘linear’}	0.611919191919	0.104864911084	[ 0.62626263 0.75555556 0.57777778 0.6625 0.4375 ]
2	degrees	LR	0.622343111971	0.0883996599293	{‘penalty’: ‘l1’}	0.567676767677	0.0721669280455	[ 0.61616162 0.55555556 0.46666667 0.675 0.525 ]
3	degrees	SVC	0.572662798195	0.0409233652853	{‘kernel’: ‘poly’}	0.542752525253	0.0751127269022	[ 0.62626263 0.5 0.5 0.6375 0.45 ]

pipe.get_results(X, y, scoring=scoring, caching_steps=['featurizer'])

Line: 1/4
Line: 2/4
Line: 3/4
Line: 4/4

	featurizer	classifier	grid_roc_auc_mean	grid_roc_auc_std	grid_roc_auc_best_params	eval_roc_auc_mean	eval_roc_auc_std	eval_roc_auc_scores
0	pagerank	LR	0.584141951429	0.0942090541588	{‘penalty’: ‘l2’}	0.639191919192	0.0805917875518	[ 0.5959596 0.76666667 0.63333333 0.675 0.525 ]
1	pagerank	SVC	0.605372877713	0.144537957686	{‘kernel’: ‘linear’}	0.611919191919	0.104864911084	[ 0.62626263 0.75555556 0.57777778 0.6625 0.4375 ]
2	degrees	LR	0.622343111971	0.0883996599293	{‘penalty’: ‘l1’}	0.567676767677	0.0721669280455	[ 0.61616162 0.55555556 0.46666667 0.675 0.525 ]
3	degrees	SVC	0.572662798195	0.0409233652853	{‘kernel’: ‘poly’}	0.542752525253	0.0751127269022	[ 0.62626263 0.5 0.5 0.6375 0.45 ]

This is the main things about maching learning on graphs. Now you can try big amount of normalizations features and classifiers for graphs classifcation. In case you need something specific you can implement temporary pipeline step to fiegure out the influence of this step on the result.