DataManagers¶
Overview¶
skater contains an abstraction for data, the DataManager. The DataManager is initialized with some data, and optionally some feature names and indexes. One created, the DataManager offers a generate_samples method, which includes options for several sampling algorithms. Any and all updates for handling, accessing, manipulating, saving, and loading data will be handled by the DataManager to ensure isolation from the rest of the code base.
Currently, skater supports numpy ndarrays and pandas dataframes, with plans on supporting sparse arrays in future versions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostedClassifier
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
features = breast_cancer.feature_names
from skater.data import DataManager
data = DataManager(X, feature_names = features)
data.generate_sample(n, strategy='random-choice')
data.generate_grid(['CRIM', 'ZN'], grid_resolution=100)
|
API¶
-
DataManager.
__init__
(X, y=None, feature_names=None, index=None, log_level=30)¶ The abstraction around using, accessing, sampling data for interpretation purposes. Used by interpretation objects to grab data, collect samples, and handle feature names and row indices.
Parameters: - X: 1D/2D numpy array, or pandas DataFrame
raw data
- y: 1D/2D numpy array, or pandas DataFrame
ground truth labels for X
- feature_names: iterable of feature names
Optional keyword containing names of features.
- index: iterable of row names
Optional keyword containing names of indexes (rows).
-
DataManager.
generate_sample
(sample=True, include_y=False, strategy='random-choice', n_samples=1000, replace=True, bin_count=50)¶ Method for generating data from the dataset.
Parameters: - sample : boolean
If False, we’ll take the full dataset, otherwise we’ll sample.
include_y: boolean (default=False)
- strategy: string (default=’random-choice’)
Supported strategy types ‘random-choice’, ‘uniform-from-percentile’, ‘uniform-over-similarity-ranks’
- n_samples : int (default=1000)
Specifies the number of samples to return. Only implemented if strategy is “random-choice”.
- replace : boolean (default=True)
Bool for sampling with or without replacement
- bin_count : int
If strategy is “uniform-over-similarity-ranks”, then this is the number of samples to take from each discrete rank.
-
DataManager.
generate_grid
(feature_ids, grid_resolution=100, grid_range=(0.05, 0.95))¶ Generates a grid of values on which to compute pdp. For each feature xi, for value yj of xi, we will fix xi = yj for every observation in X.
Parameters: - feature_ids(list):
Feature names for which we’ll generate a grid. Must be contained by self.feature_ids
- grid_resolution(int):
The number of unique values to choose for each feature.
- grid_range(tuple):
The percentile bounds of the grid. For instance, (.05, .95) corresponds to the 5th and 95th percentiles, respectively.
Returns: - grid(numpy.ndarray): There are as many rows as there are feature_ids
There are as many columns as specified by grid_resolution
-
DataManager.
generate_column_sample
(feature_id, *args, **kwargs)¶ Sample a single feature from the data set.
Parameters: - feature_id: hashable
name of the feature to sample. If no feature names were passed, then the features are accessible via their column index.