DataManagers

Overview

skater contains an abstraction for data, the DataManager. The DataManager is initialized with some data, and optionally some feature names and indexes. One created, the DataManager offers a generate_samples method, which includes options for several sampling algorithms. Any and all updates for handling, accessing, manipulating, saving, and loading data will be handled by the DataManager to ensure isolation from the rest of the code base.

Currently, skater supports numpy ndarrays and pandas dataframes, with plans on supporting sparse arrays in future versions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostedClassifier

breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
features = breast_cancer.feature_names


from skater.data import DataManager
data = DataManager(X, feature_names = features)

data.generate_sample(n, strategy='random-choice')
data.generate_grid(['CRIM', 'ZN'], grid_resolution=100)

API

DataManager.__init__(X, y=None, feature_names=None, index=None, log_level=30)

The abstraction around using, accessing, sampling data for interpretation purposes. Used by interpretation objects to grab data, collect samples, and handle feature names and row indices.

Parameters:
X: 1D/2D numpy array, or pandas DataFrame

raw data

y: 1D/2D numpy array, or pandas DataFrame

ground truth labels for X

feature_names: iterable of feature names

Optional keyword containing names of features.

index: iterable of row names

Optional keyword containing names of indexes (rows).

DataManager.generate_sample(sample=True, include_y=False, strategy='random-choice', n_samples=1000, replace=True, bin_count=50)

Method for generating data from the dataset.

Parameters:
sample : boolean

If False, we’ll take the full dataset, otherwise we’ll sample.

include_y: boolean (default=False)

strategy: string (default=’random-choice’)

Supported strategy types ‘random-choice’, ‘uniform-from-percentile’, ‘uniform-over-similarity-ranks’

n_samples : int (default=1000)

Specifies the number of samples to return. Only implemented if strategy is “random-choice”.

replace : boolean (default=True)

Bool for sampling with or without replacement

bin_count : int

If strategy is “uniform-over-similarity-ranks”, then this is the number of samples to take from each discrete rank.

DataManager.generate_grid(feature_ids, grid_resolution=100, grid_range=(0.05, 0.95))

Generates a grid of values on which to compute pdp. For each feature xi, for value yj of xi, we will fix xi = yj for every observation in X.

Parameters:
feature_ids(list):

Feature names for which we’ll generate a grid. Must be contained by self.feature_ids

grid_resolution(int):

The number of unique values to choose for each feature.

grid_range(tuple):

The percentile bounds of the grid. For instance, (.05, .95) corresponds to the 5th and 95th percentiles, respectively.

Returns:
grid(numpy.ndarray): There are as many rows as there are feature_ids

There are as many columns as specified by grid_resolution

DataManager.generate_column_sample(feature_id, *args, **kwargs)

Sample a single feature from the data set.

Parameters:
feature_id: hashable

name of the feature to sample. If no feature names were passed, then the features are accessible via their column index.