Interpretation Objects¶
Overview¶
Interpretation are initialized with a DataManager object, and expose interpretation algorithms as methods. For instance:
1 2 3 4  from skater import Interpretation()
interpreter = Interpretation()
interpreter.load_data(data)
interpreter.feature_importance.feature_importance(model)

Loading Data¶
Before running interpretation algorithms on a model, the Interpretation object usually needs data, either to learn about the distribution of the training set or to pass inputs into a prediction function.
When calling Interpretation.load_data, the object creates a DataManager object, which handles the data, keeping track of feature and observation names, as well as providing various sampling algorithms.
Currently load_data requires a numpy ndarray or pandas DataFrame, though we may add support for additional data structures in the future. For more details on what the DataManager does, please see the relevant documentation [PROVIDE LINK].

Interpretation.
load_data
(training_data, training_labels=None, feature_names=None, index=None)¶ Creates a DataSet object from inputs, ties to interpretation object. This will be exposed to all submodules.
Parameters:  training_data: numpy.ndarray, pandas.DataFrame
the dataset. can be 1D or 2D
 feature_names: arraytype
names to call features.
 index: arraytype
names to call rows.
Returns:  None
Global Interpretations¶
A predictive model is a mapping from an input space to an output space. Interpretation algorithms are divided into those that offer statistics and metrics on regions of the domain, such as the marginal distribution of a feature, or the joint distribution of the entire training set. In an ideal world there would exist some representation that would allow a human to interpret a decision function in any number of dimensions. Given that we generally can only intuit visualizations of a few dimensions at time, global interpretation algorithms either aggregate or subset the feature space.
Currently, model agnostic global interpretation algorithms supported by skater include partial dependence and feature importance.
Feature Importance¶
Feature importance is generic term for the degree to which a predictive model relies on a particular feature. skater feature importance implementation is based on an information theoretic criteria, measuring the entropy in the change of predictions, given a perturbation of a given feature. The intuition is that the more a model’s decision criteria depend on a feature, the more we’ll see predictions change as a function of perturbing a feature.
Jupyter Notebooks

class
skater.core.global_interpretation.feature_importance.
FeatureImportance
(interpreter)¶ Contains methods for feature importance. Subclass of BaseGlobalInterpretation.
Attributes: data_set
data_set routes to the Interpreter’s dataset
training_labels
training_labels routes to the Interpreter’s training labels
Methods
feature_importance
(model_instance[, …])Computes feature importance of all features related to a model instance. load_data
(training_data[, index, feature_names]).consider routes to Interpreter’s .consider plot_feature_importance
(modelinstance[, …])Computes feature importance of all features related to a model instance, then plots the results. 
data_set
¶ data_set routes to the Interpreter’s dataset

feature_importance
(model_instance, ascending=True, filter_classes=None, n_jobs=1, progressbar=True, n_samples=5000, method='predictionvariance', scorer_type='default', use_scaling=False)¶ Computes feature importance of all features related to a model instance. Supports classification, multiclass classification, and regression.
Parameters:  model_instance: skater.model.model.Model subtype
the machine learning model “prediction” function to explain, such that predictions = predict_fn(data).
 ascending: boolean, default True
Helps with ordering Ascending vs Descending
 filter_classes: array type
The classes to run partial dependence on. Default None invokes all classes. Only used in classification models.
 n_jobs: int
How many concurrent processes to use. Defaults 1, which grabs as many as are available. Use 1 to avoid multiprocessing altogether.
 progressbar: bool
Whether to display progress. This affects which function we use to operate on the pool of processes, where including the progress bar results in 1020% slowdowns.
 n_samples: int
How many samples to use when computing importance.
 method: string (default ‘predictionvariance’; ‘modelscoring’ for estimator specific scoring metric
How to compute feature importance. ‘modelscoring’ requires Interpretation.training_labels. Note this choice should only rarely makes any significant differences predictionvariance: mean absolute value of changes in predictions, given perturbations. modelscoring: difference in log_loss or MAE of training_labels given perturbations.
 scorer_type: string
only used when method=’modelscoring’, and in this case defines which scoring function to use. Default value is ‘default’, which evaluates to:
regressors: mean absolute error classifiers with probabilities: cross entropy classifiers without probabilities: f1 score
See Skater.model.scorers for details.
 use_scaling: bool
Whether to weight the importance values by the strength of the perturbations. Generally doesn’t effect results unless n_samples is very small.
Returns:  importances : Sorted Series
References
Wei, Pengfei, Zhenzhou Lu, and Jingwen Song. “Variable Importance Analysis: A Comprehensive Review”. Reliability Engineering & System Safety 142 (2015): 399432.
Examples
>>> from skater.model import InMemoryModel >>> from skater.core.explanations import Interpretation >>> from sklearn.ensemble import RandomForestClassifier >>> rf = RandomForestClassifier() >>> rf.fit(X,y) >>> model = InMemoryModel(rf.predict_proba, examples = X) >>> interpreter = Interpretation() >>> interpreter.load_data(X) >>> interpreter.feature_importance.feature_importance(model)

load_data
(training_data, index=None, feature_names=None)¶ .consider routes to Interpreter’s .consider

plot_feature_importance
(modelinstance, filter_classes=None, ascending=True, ax=None, progressbar=True, n_jobs=1, n_samples=5000, method='predictionvariance', scorer_type='default', use_scaling=False)¶ Computes feature importance of all features related to a model instance, then plots the results. Supports classification, multiclass classification, and regression.
Parameters:  modelinstance: skater.model.model.Model subtype
estimator “prediction” function to explain the predictive model. Could be probability scores or target values
 filter_classes: array type
The classes to run partial dependence on. Default None invokes all classes. Only used in classification models.
 ascending: boolean, default True
Helps with ordering Ascending vs Descending
 ax: matplotlib.axes._subplots.AxesSubplot
existing subplot on which to plot feature importance. If none is provided, one will be created.
 progressbar: bool
Whether to display progress. This affects which function we use to operate on the pool of processes, where including the progress bar results in 1020% slowdowns.
 n_jobs: int
How many concurrent processes to use. Defaults 1, which grabs as many as are available. Use 1 to avoid multiprocessing altogether.
 n_samples: int
How many samples to use when computing importance.
 method: string
How to compute feature importance. ‘modelscoring’ requires Interpretation.training_labels predictionvariance: mean absolute value of changes in predictions, given perturbations. modelscoring: difference in log_loss or MAE of training_labels given perturbations. Note this vary rarely makes any significant differences
 scorer_type: string
only used when method=’modelscoring’, and in this case defines which scoring function to use. Default value is ‘default’, which evaluates to:
regressors: mean absolute error classifiers with probabilities: cross entropy classifiers without probabilities: f1 score
See Skater.model.scorers for details.
 use_scaling: bool
Whether to weight the importance values by the strength of the perturbations. Generally doesn’t effect results unless n_samples is very small.
Returns:  f: figure instance
 ax: matplotlib.axes._subplots.AxesSubplot
could be used to for further modification to the plots
Examples
>>> from skater.model import InMemoryModel >>> from skater.core.explanations import Interpretation >>> from sklearn.ensemble import RandomForestClassifier >>> rf = RandomForestClassifier() >>> rf.fit(X,y) >>> model = InMemoryModel(rf.predict_proba, examples = X) >>> interpreter = Interpretation() >>> interpreter.load_data(X) >>> interpreter.feature_importance.plot_feature_importance(model, ascending=True, ax=ax)

training_labels
¶ training_labels routes to the Interpreter’s training labels
Partial Dependence¶
Partial Dependence describes the marginal impact of a feature on model prediction, holding other features in the model constant. The derivative of partial dependence describes the impact of a feature (analogous to a feature coefficient in a regression model).
Jupyter Notebooks

class
skater.core.global_interpretation.partial_dependence.
PartialDependence
(interpreter)¶ Contains methods for partial dependence. Subclass of BaseGlobalInterpretation
Partial dependence adapted from:
T. Hastie, R. Tibshirani and J. Friedman, Elements of Statistical Learning Ed. 2, Springer, 2009.
Attributes: data_set
data_set routes to the Interpreter’s dataset
training_labels
training_labels routes to the Interpreter’s training labels
Methods
compute_3d_gradients
(pdp, mean_col, …[, …])Computes componentwise gradients of pdp dataframe. load_data
(training_data[, index, feature_names]).consider routes to Interpreter’s .consider partial_dependence
(feature_ids, modelinstance)Approximates the partial dependence of the predict_fn with respect to the variables passed. plot_partial_dependence
(feature_ids, …[, …])Computes partial_dependence of a set of variables. feature_column_name_formatter 
static
compute_3d_gradients
(pdp, mean_col, feature_1, feature_2, scaled=True)¶ Computes componentwise gradients of pdp dataframe.
Parameters:  pdp: pandas.DataFrame
DataFrame containing partial dependence values
 mean_col: string
column name corresponding to pdp value
 feature_1: string
column name corresponding to feature 1
 feature_2: string
column name corresponding to feature 2
 scaled: bool
Whether to scale the x1 and x2 gradients relative to x1 and x2 bin sizes
Returns:  dx, dy, x_matrix, y_matrix, z_matrix

data_set
¶ data_set routes to the Interpreter’s dataset

load_data
(training_data, index=None, feature_names=None)¶ .consider routes to Interpreter’s .consider

partial_dependence
(feature_ids, modelinstance, filter_classes=None, grid=None, grid_resolution=30, n_jobs=1, grid_range=None, sample=True, sampling_strategy='randomchoice', n_samples=1000, bin_count=50, return_metadata=False, progressbar=True, variance_type='estimate')¶ Approximates the partial dependence of the predict_fn with respect to the variables passed.

plot_partial_dependence
(feature_ids, modelinstance, filter_classes=None, grid=None, grid_resolution=30, grid_range=None, n_jobs=1, sample=True, sampling_strategy='randomchoice', n_samples=1000, bin_count=50, with_variance=False, figsize=(16, 10), progressbar=True, variance_type='estimate')¶ Computes partial_dependence of a set of variables. Essentially approximates the partial partial_dependence of the predict_fn with respect to the variables passed.
Examples
>>> from sklearn.ensemble import GradientBoostingRegressor >>> from sklearn.datasets.california_housing import fetch_california_housing >>> cal_housing = fetch_california_housing() # split 80/20 traintest >>> x_train, x_test, y_train, y_test = train_test_split(cal_housing.data, >>> cal_housing.target, test_size=0.2, random_state=1) >>> names = cal_housing.feature_names >>> print("Training the estimator...") >>> estimator = GradientBoostingRegressor(n_estimators=10, max_depth=4, >>> learning_rate=0.1, loss='huber', random_state=1) >>> estimator.fit(x_train, y_train) >>> from skater.core.explanations import Interpretation >>> interpreter = Interpretation() >>> print("Feature name: {}".format(names)) >>> interpreter.load_data(X_train, feature_names=names) >>> print("Input feature name: {}".format[names[1], names[5]]) >>> from skater.model import InMemoryModel >>> model = InMemoryModel(clf.predict, examples = X_train) >>> interpreter.partial_dependence.plot_partial_dependence([names[1], names[5]], model, >>> n_samples=100, n_jobs=1)

training_labels
¶ training_labels routes to the Interpreter’s training labels
Local Interpretations¶
Local Interpretation could be possibly be achieved in two ways. Firstly, one could possibly approximate the behavior of a complex predictive model in the vicinity of a single input using a simple interpretable auxiliary or surrogate model (e.g. Linear Regressor). Secondly, one could use the base estimator to understand the behavior of a single prediction using intuitive approximate functions based on inputs and outputs.
Local Interpretable ModelAgnostic Explanations(LIME)¶
LIME is a novel algorithm designed by Riberio Marco, Singh Sameer, Guestrin Carlos to access the behavior of the any base estimator(model) using interpretable surrogate models (e.g. linear classifier/regressor). Such form of comprehensive evaluation helps in generating explanations which are locally faithful but may not align with the global behavior.
Reference:  Riberio M, Singh S, Guestrin C(2016). Why Should {I} Trust You?”: Explaining the Predictions of Any Classifier (arXiv:1602.04938v3) 


class
skater.core.local_interpretation.lime.lime_tabular.
LimeTabularExplainer
(training_data, mode='classification', training_labels=None, feature_names=None, categorical_features=None, categorical_names=None, kernel_width=None, verbose=False, class_names=None, feature_selection='auto', discretize_continuous=True, discretizer='quartile')¶ Explains predictions on tabular (i.e. matrix) data. For numerical features, perturb them by sampling from a Normal(0,1) and doing the inverse operation of meancentering and scaling, according to the means and stds in the training data. For categorical features, perturb by sampling according to the training distribution, and making a binary feature that is 1 when the value is the same as the instance being explained.
Methods
explain_instance
(data_row, predict_fn[, …])Generates explanations for a prediction. convert_and_round 
explain_instance
(data_row, predict_fn, labels=(1, ), top_labels=None, num_features=10, num_samples=5000, distance_metric='euclidean', model_regressor=None)¶ Generates explanations for a prediction.
First, we generate neighborhood data by randomly perturbing features from the instance (see __data_inverse). We then learn locally weighted linear models on this neighborhood data to explain each of the classes in an interpretable way (see lime_base.py).
 Args:
data_row: 1d numpy array, corresponding to a row predict_fn: prediction function. For classifiers, this should be a
function that takes a numpy array and outputs prediction probabilities. For regressors, this takes a numpy array and returns the predictions. For ScikitClassifiers, this is
classifier.predict_proba(). For ScikitRegressors, this is regressor.predict().labels: iterable with labels to be explained. top_labels: if not None, ignore labels and produce explanations for
the K labels with highest prediction probabilities, where K is this parameter.num_features: maximum number of features present in explanation num_samples: size of the neighborhood to learn the linear model distance_metric: the distance metric to use for weights. model_regressor: sklearn regressor to use in explanation. Defaults to Ridge regression in LimeBase. Must have model_regressor.coef_ and ‘sample_weight’ as a parameter to model_regressor.fit()
 Returns:
 An Explanation object (see explanation.py) with the corresponding explanations.

DNNs: DeepInterpreter¶
Helps in interpreting Deep Neural Network Models by computing the relevance/attribution of the output prediction of a deep network to its input features. The intention is to understand the inputoutput behavior of the complex network based on relevant contributing features.
Define Relevance: Also known as Attribution or Contribution. Lets define an input X = \([x1, x2, ... xn] \in R^{n}\) to a deep neural network(F) trained for binary classification (\(F(x) \mapsto [0, 1]\)). The goal of the relevance/attribution method is to compute the contribution scores of each input feature \(x_{i}\) to the output prediction. For e.g. for an image classification network, if the input \(x_{i}\) is represented as each pixel of the image, the attribution scores \((a1, ..., an) \in R^{n}\) could inform us which pixels of the image contributed in the selection of the particular class label.

class
skater.core.local_interpretation.dnni.deep_interpreter.
DeepInterpreter
(graph=None, session=None, log_level=30)¶ :: Experimental :: The implementation is currently experimental and might change in future Interpreter for inferring Deep Learning Models. Given a trained NN model and an input vector X, DeepInterpreter is responsible for providing relevance scores w.r.t a target class to analyze most contributing features driving an estimator’s decision for or against the respective class
Framework supported: Tensorflow(>=1.4.0) and Keras(>=2.0.8)
Parameters:  graph : tensorflow.Graph instance
 session : tensorflow.Session to execute the graph(default session: tf.get_default_session())
 log_level : int (default: _WARNING)
The log_level could be adjusted to other values as well. Check here ./skater/util/logger.py
References
[1] Ancona M, Ceolini E, Oztireli C, Gross M (ICLR, 2018). Towards better understanding of gradientbased attribution methods for Deep Neural Networks. https://arxiv.org/abs/1711.06104 (https://github.com/marcoancona/DeepExplain/blob/master/deepexplain/tensorflow/methods.py) Methods
explain
(relevance_type, output_tensor, …)Helps in computing the relevance scores for DNNs to understand the input and output behavior of the network. 
explain
(relevance_type, output_tensor, input_tensor, samples, use_case=None, **kwargs)¶ Helps in computing the relevance scores for DNNs to understand the input and output behavior of the network.
Parameters:  relevance_type: str
Currently, relevance score could be computed using eLRP(‘elrp’) or Integrated Gradient(‘ig’). Other algorithms are under development.
 epsilonLRP(‘eLRP’): Is recommended with Activation ops (‘ReLU’ and ‘Tanh’). Current implementation of LRP works only for images and makes use of epsilon(default: 0.0001) as a stabilizer.
 Integrated Gradient(‘ig’): Is recommended with Activation ops (‘Relu’, ‘Elu’, ‘Softplus’, ‘Tanh’, ‘Sigmoid’). It works for images and text. Optional parameters include steps(default: 100) and baseline(default: {‘image’: ‘a black image’}; {‘txt’: zero input embedding vector}) Gradient is computed by varying the input from the baseline(x’) to the provided input(x). x, x’ are element of R with n dimension —> [0,1]
 output_tensor: tensorflow.python.framework.ops.Tensor
Specify the output layer to start from
 input_tensor: tensorflow.python.framework.ops.Tensor
Specify the input layer to reach to
 samples: numpy.array
Batch of input for which explanations are desired. Note: The first dimension of the array specifies the batch size. For e.g.,
 for an image input of batch size 2: (2, 150, 150, 3) <batch_size, image_width, image_height, no_of_channels>
 for a text input of batch size 1: (1, 80) <batch_size, embedding_dimensions>
 use_case: str
Options: ‘image’ or ‘txt
 kwargs: optional
Returns:  result: numpy.ndarray
Computed relevance(contribution) score for the given input
References
[1] Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W (2015) On PixelWise Explanations for NonLinear Classifier Decisions by LayerWise Relevance Propagation. PLoS ONE 10(7): e0130140. https://doi.org/10.1371/journal.pone.0130140 [2] Sundararajan, M, Taly, A, Yan, Q (ICML, 2017). Axiomatic Attribution for Deep Networks. http://arxiv.org/abs/1703.01365 [3] Ancona M, Ceolini E, Öztireli C, Gross M (ICLR, 2018). Towards better understanding of gradientbased attribution methods for Deep Neural Networks. https://arxiv.org/abs/1711.06104 Examples
>>> from skater.core.local_interpretation.dnni.deep_interpreter import DeepInterpreter >>> ... >>> import keras >>> from keras.datasets import mnist >>>from keras.models import Sequential, Model, load_model, model_from_yaml >>> from keras.layers import Dense, Dropout, Flatten, Activation >>> from keras.layers import Conv2D, MaxPooling2D >>> from keras import backend as K >>> import tensorflow as tf >>> import matplotlib.pyplot as plt >>> sess = tf.Session() >>> K.set_session(sess) >>> ... # Load dataset >>> # A simple network for MNIST dataset using Keras >>> model = Sequential() >>> model.add(Conv2D(32, kernel_size=(3, 3), >>> activation='relu', >>> input_shape=input_shape)) >>> model.add(Conv2D(64, (3, 3), activation='relu')) >>> model.add(MaxPooling2D(pool_size=(2, 2))) >>> model.add(Dropout(0.25)) >>> model.add(Flatten()) >>> model.add(Dense(128, activation='relu')) >>> model.add(Dropout(0.5)) >>> model.add(Dense(num_classes)) >>> model.add(Activation('softmax')) >>> ... # Compile and train the model >>> K.set_learning_phase(0) >>> with DeepInterpreter(session=K.get_session()) as di: >>> # 1. Load the persisted model >>> # 2. Retrieve the input tensor from the loaded model >>> yaml_file = open('model_sample.yaml', 'r') >>> loaded_model_yaml = yaml_file.read() >>> yaml_file.close() >>> loaded_model = model_from_yaml(loaded_model_yaml) >>> # load weights into new model >>> loaded_model.load_weights("model_mnist_cnn_3.h5") >>> print("Loaded model from disk") >>> input_tensor = loaded_model.layers[0].input >>> output_tensor = loaded_model.layers[2].output
>>> # 3. We will using the last dense layer(presoftmax) as the output layer >>> # 4. Instantiate a model with the new input and output tensor >>> new_model = Model(inputs=input_tensor, outputs=output_tensor) >>> target_tensor = new_model(input_tensor) >>> xs = input_x >>> ys = input_y >>> print("X shape: {}".format(xs.shape)) >>> print("Y shape: {}".format(ys.shape)) >>> # Original Predictions >>> print(loaded_model.predict_classes(xs)) >>> relevance_scores = di.explain('elrp', output_tensor=target_tensor * ys, input_tensor=input_tensor, >>> samples=xs, use_case='image')
DNNs: Layerwise Relevance Propagation(eLRP)¶

class
skater.core.local_interpretation.dnni.gradient_relevance_scorer.
LRP
(output_tensor, input_tensor, samples, session, epsilon=0.0001)¶ LRP is technique to decompose the prediction(output) of a deep neural networks(DNNs) by computing relevance at each layer in a backward pass. Current implementation is computed using backpropagation by applying change rule on a modified gradient function. LRP could be implemented in different ways. This version implements the epsilonLRP(Eq (58) as stated in [1] or Eq (2) in [2]. Epsilon acts as a numerical stabilizer.
References
[1] Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W (2015) On PixelWise Explanations for NonLinear Classifier Decisions by LayerWise Relevance Propagation. PLoS ONE 10(7): e0130140. https://doi.org/10.1371/journal.pone.0130140 [2] Ancona M, Ceolini E, Öztireli C, Gross M: Towards better understanding of gradientbased attribution methods for Deep Neural Networks. ICLR, 2018
DNNs: Integrated Gradient¶

class
skater.core.local_interpretation.dnni.gradient_relevance_scorer.
IntegratedGradients
(output_tensor, input_tensor, samples, session, steps=100, baseline=None)¶ Integrated Gradient is a relevance scoring algorithm for Deep network based on final predictions to its input features. The algorithm statisfies two fundamental axioms related to relevance/attribution computation,
1.Sensitivity : For every input and baseline, if the change in one feature causes the prediction to change, then the that feature should have nonzero relevance score
2.Implementation Invariance : Compute relevance(attribution) should be identical for functionally equivalent networks.
References
[1] Sundararajan, Mukund, Taly, Ankur, Yan, Qiqi (ICML, 2017). [2] Ancona M, Ceolini E, Öztireli C, Gross M (ICLR, 2018). [3] Taly, Ankur(2017) http://theory.stanford.edu/~ataly/Talks/sri_attribution_talk_jun_2017.pdf
DNNs: Occlusion¶

class
skater.core.local_interpretation.dnni.perturbation_relevance_scorer.
Occlusion
(output_tensor, input_tensor, samples, current_session, **kwargs)¶ Occlusion is a perturbation based inference algorithm. Such forms of algorithm direcly computes the relevance/attribution of the input features \((X_{i})\) by systematically occluding different portions of the image (by removing, masking or altering them), then running a forward pass on the new input to produce a new output, and then measuring and monitoring the difference between the original output and new output. Perturbation based interpretation helps one to compute direct estimation of the marginal effect of a feature but the inference might be computationally expensive depending on the cardinatlity of the feature space. The choice of the baseline value while perturbing through the feature space could be set to 0, as explained in detail by Zeiler & Fergus, 2014[2].
References
[1] Ancona M, Ceolini E, Oztireli C, Gross M (ICLR, 2018). [2] Zeiler, M and Fergus, R (Springer, 2014). Visualizing and understanding convolutional networks. [3] https://github.com/marcoancona/DeepExplain/blob/master/deepexplain/tensorflow/methods.py
Global And Local Interpretations¶
Tree Surrogates (using Decision Trees)¶

class
skater.core.global_interpretation.tree_surrogate.
TreeSurrogate
(oracle=None, splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, seed=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight='balanced', presort=False, impurity_threshold=0.01)¶ :: Experimental :: The implementation is currently experimental and might change in future. The idea of using TreeSurrogates as means for explaining a model’s(Oracle or the base model) learned decision policies(for inductive learning tasks) is inspired by the work of Mark W. Craven described as the TREPAN algorithm. In this explanation learning hypothesis, the base estimator(Oracle) could be any form of supervised learning predictive models. The explanations are approximated using DecisionTrees(both for Classification/Regression) by learning decision boundaries similar to that learned by the Oracle(predictions from the base model are used for learning the DecisionTree representation). The implementation also generates a fidelity score to quantify tree based surrogate model’s approximation to the Oracle. Ideally, the score should be 0 for truthful explanation both globally and locally.
Parameters:  oracle : InMemory instance type
model instance having access to the base estimator(InMemory/DeployedModel). Currently, only InMemory is supported.
 splitter : str (default=”best”)
Strategy used to split at each the node. Supported strategies(“best” or “random”).
 max_depth : int (default=None)
Defines the maximum depth of a tree. If ‘None’ then nodes are expanded till all leaves are pure or contain less than min_samples_split samples. Deeper trees are prone to be more expensive and tend to overfit. Pruning is a technique which could be applied to avoid overfitting.
 min_samples_split : int/float (default=2)
Defines the minimum number of samples required to split an internal node:
 int, specifies the minimum number of samples
 float, then represents a percentage. Minimum number of samples is computed as ceil(min_samples_split*n_samples)
 min_samples_leaf : int/float (default=1)
Defines requirement for a leaf node. The minimum number of samples needed to be a leaf node:
 int, specifies the minimum number of samples
 float, then represents a percentage. Minimum number of samples is computed as `ceil(min_samples_split*n_samples)
 min_weight_fraction_leaf : float (default=0.0)
Defines requirement for a leaf node. The minimum weight percentage of the sum total of the weights of all input samples.
 max_features : int, float, string or None (default=None)
Defines number of features to consider for the best possible split:
 None, all specified features are used (oracle.feature_names)
 int, uses specified values as max_features at each split.
 float, as a percentage. Value for split is computed as int(max_features * n_features).
 “auto”, max_features=sqrt(n_features).
 “sqrt”, max_features=sqrt(n_features).
 “log2”, max_features=log2(n_features).
 seed : int, (default=None)
seed for random number generator
 max_leaf_nodes : int or None (default=None)
TreeSurrogates are constructed topdown in best first manner(best decrease in relative impurity). If None, results in maximum possible number of leaf nodes. This tends to overfitting.
 min_impurity_decrease : float (default=0.0)
Tree node is considered for splitting if relative decrease in impurity is >= min_impurity_decrease.
 class_weight : dict, list of dicts, str (“balanced” or None) (default=”balanced”)
Weights associated with classes for handling data imbalance:
 None, all classes have equal weights
 “balanced”, adjusts the class weights automatically. Weights are assigned inversely proportional to class frequencies
n_samples / (n_classes * np.bincount(y))
 presort : bool (default=False)
Sorts the data before building surrogates trees to find the best splits. When dealing with larger datasets, setting it to True might result in increasing computation time because of the pre sorting operation.
 impurity_threshold : float (default=0.01)
Specifies the acceptable disparity between the Oracle and TreeSurrogates. The higher the difference between the Oracle and TreeSurrogate less faithful are the explanations generated.
References
[1] Mark W. Craven(1996) EXTRACTING COMPREHENSIBLE MODELS FROM TRAINED NEURAL NETWORKS (http://ftp.cs.wisc.edu/machinelearning/shavlikgroup/craven.thesis.pdf) [2] Mark W. Craven and Jude W. Shavlik(NIPS, 96). Extracting TheeStructured Representations of Thained Networks (https://papers.nips.cc/paper/1152extractingtreestructuredrepresentationsoftrainednetworks.pdf) [3] DecisionTreeClassifier: http://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html [4] DecisionTreeRegressor: http://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html Examples
>>> from skater.core.explanations import Interpretation >>> from skater.model import InMemoryModel >>> from skater.util.logger import _INFO >>> interpreter = Interpretation(X_train, feature_names=iris.feature_names) >>> model_inst = InMemoryModel(clf.predict, examples=X_train, model_type='classifier', unique_values=[0, 1], >>> feature_names=iris.feature_names, target_names=iris.target_names, log_level=_INFO) >>> # Using the interpreter instance invoke call to the TreeSurrogate >>> surrogate_explainer = interpreter.tree_surrogate(oracle=model_inst, seed=5) >>> surrogate_explainer.fit(X_train, y_train, use_oracle=True, prune='post', scorer_type='default') >>> surrogate_explainer.plot_global_decisions(colors=['coral', 'lightsteelblue','darkkhaki'], >>> file_name='simple_tree_pre.png') >>> show_in_notebook('simple_tree_pre.png', width=400, height=300)
Attributes:  oracle : skater.model.local_model.InMemoryModel
The fitted base model with the prediction function
 feature_names: list of str
Names of the features considered.
estimator_
: DecisionTreeClassifier/DecisionTreeRegressorLearned approximate surrogate estimator
estimator_type_
: strEstimator type
best_score_
: numpy.float64Best score post prepruning
scorer_name_
: strCost function used for optimization
Methods
decisions_as_txt
([scope, X])Retrieve the decision policies as text fit
(X, Y[, use_oracle, prune, cv, …])Learn an approximate representation by constructing a Decision Tree based on the results retrieved by querying the Oracle(base model). plot_global_decisions
([colors, …])Visualizes the decision policies of the surrogate tree. predict
(X[, prob_score])Predict for input X 
best_score_
¶ Best score post prepruning

decisions_as_txt
(scope='global', X=None)¶ Retrieve the decision policies as text

estimator_
¶ Learned approximate surrogate estimator

estimator_type_
¶ Estimator type

fit
(X, Y, use_oracle=True, prune='post', cv=5, n_iter_search=10, scorer_type='default', n_jobs=1, param_grid=None, impurity_threshold=0.01, verbose=False)¶ Learn an approximate representation by constructing a Decision Tree based on the results retrieved by querying the Oracle(base model). Instances used for training should belong to the base learners instance space.
Parameters:  X : numpy.ndarray, pandas.DataFrame
Training input samples
 Y : numpy.ndarray, target values(ground truth)
 use_oracle : bool (defaul=True)
Use of Oracle, helps the Surrogate model train on the decision boundaries learned by the base model. The closer the surrogate model is to the Oracle, more faithful are the explanations.
 True, builds a surrogate model against the predictions of the base model(Oracle).
 False, learns an interpretable tree based model using the supplied training examples and ground truth.
 prune : None, str (default=”post”)
Pruning is a useful technique to control the complexity of the tree (keeping the trees comprehensive and interpretable) without compromising on model’s accuracy. Avoiding to build large and deep trees also helps in preventing overfitting.
 “pre”
Also known as forward/online pruning. This pruning process uses a termination condition(high and low thresholds) to prematurely terminate some of the branches and nodes. Cross Validation is applied to measure the goodness of the fit while the tree is pruned.
 “pos”
Also known as backward pruning. The pruning process is applied post the construction of the tree using the specified model parameters. This involves reducing the branches and nodes using a cost function. The current implementation support cost optimization using Model’s scoring metrics(e.g. r2, logloss, f1, …).
 cv : int, (default=5)
Randomized cross validation used only for ‘prepruning’ right now.
 n_iter_search : int (default=10)
Number of parameter setting combinations that are sampled for prepruning.
 scorer_type : str (default=”default”)
 n_jobs : int (default=1)
Number of jobs to run in parallel.
 param_grid : dict
Dictionary of parameters to specify the termination condition for prepruning.
 impurity_threshold : float (default=0.01)
Specifies acceptable performance drop when using Tree based surrogates to replicate the decision policies learned by the Oracle
 verbose : bool (default=False)
Helps control the verbosity.
References
[1] Nikita Patel and Saurabh Upadhyay(2012) Study of Various Decision Tree Pruning Methods with their Empirical Comparison in WEKA (https://pdfs.semanticscholar.org/025b/8c109c38dc115024e97eb0ede5ea873fffdb.pdf)

plot_global_decisions
(colors=None, enable_node_id=True, random_state=0, file_name='interpretable_tree.png', show_img=False, fig_size=(20, 8))¶ Visualizes the decision policies of the surrogate tree.

predict
(X, prob_score=False)¶ Predict for input X

scorer_name_
¶ Cost function used for optimization
Bayesian Rule Lists(BRL)¶

class
skater.core.global_interpretation.interpretable_models.brlc.
BRLC
(iterations=30000, pos_sign=1, neg_sign=0, min_rule_len=1, max_rule_len=8, min_support_pos=0.1, min_support_neg=0.1, eta=1.0, n_chains=10, alpha=1, lambda_=10, discretize=True, drop_features=False)¶ :: Experimental :: The implementation is currently experimental and might change in future
BRLC(Bayesian Rule List Classifier) is a python wrapper for SBRL(Scalable Bayesian Rule list). SBRL is a scalable Bayesian Rule List. It’s a generative estimator to build hierarchical interpretable decision lists. This python wrapper is an extension to the work done by Professor Cynthia Rudin, Benjamin Letham, Hongyu Yang, Margo Seltzer and others. For more information check out the reference section below.
Parameters:  iterations : int (default=30000)
number of iterations for each MCMC chain.
 pos_sign : int (default=1)
sign for the positive labels in the “label” column.
 neg_sign : int (default=0)
sign for the negative labels in the “label” column.
 min_rule_len : int (default=1)
minimum number of cardinality for rules to be mined from the dataframe.
 max_rule_len : int (default=8)
maximum number of cardinality for rules to be mined from the dataframe.
 min_support_pos : float (default=0.1)
a number between 0 and 1, for the minimum percentage support for the positive observations.
 min_support_neg : float (default 0.1)
a number between 0 and 1, for the minimum percentage support for the negative observations.
 eta : int (default=1)
a hyperparameter for the expected cardinality of the rules in the optimal rule list.
 n_chains : int (default=10)
number of chains
 alpha : int (default=1)
a prior pseudocount for the positive(alpha1) and negative(alpha0) classes. Default values (1, 1)
 lambda_ : int (default=8)
a hyperparameter for the expected length of the rule list.
 discretize : bool (default=True)
apply discretizer to handle continuous features.
 drop_features : bool (default=False)
once continuous features are discretized, use this flag to either retain or drop them from the dataframe
References
[1] Letham et.al(2015) Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model (https://arxiv.org/abs/1511.01644) [2] Yang et.al(2016) Scalable Bayesian Rule Lists (https://arxiv.org/abs/1602.08610) [3] https://github.com/Hongyuy/sbrlpythonwrapper/blob/master/sbrl/C_sbrl.py Examples
>>> from skater.core.global_interpretation.interpretable_models.brlc import BRLC >>> import pandas as pd >>> from sklearn.datasets.mldata import fetch_mldata >>> input_df = fetch_mldata("diabetes") ... >>> Xtrain, Xtest, ytrain, ytest = train_test_split(input_df, y, test_size=0.20, random_state=0) >>> sbrl_model = BRLC(min_rule_len=1, max_rule_len=10, iterations=10000, n_chains=20, drop_features=True) >>> # Train a model, by default discretizer is enabled. So, you wish to exclude features then exclude them using >>> # the undiscretize_feature_list parameter >>> model = sbrl_model.fit(Xtrain, ytrain, bin_labels="default") >>> #print the learned model >>> sbrl_inst.print_model() >>> features_to_descritize = Xtrain.columns >>> Xtrain_filtered = sbrl_model.discretizer(Xtrain, features_to_descritize, labels_for_bin="default") >>> predict_scores = sbrl_model.predict_proba(Xtest) >>> _, y_hat = sbrl_model.predict(Xtest) >>> # save and reload the model and continue with evaluation >>> sbrl_model.save_model("model.pkl") >>> sbrl_model.load_model("model.pkl") >>> # to access all the learned rules >>> sbrl_model.access_learned_rules("all") # For a complete example refer to rule_lists_continuous_features.ipynb or rule_lists_titanic_dataset.ipynb notebook
Methods
access_learned_rules
([rule_indexes])Access all learned decision rules. discretizer
(X, column_list[, …])A discretizer for continuous features filter_to_be_discretize
(clmn_list, unwanted_list)fit
(X, y_true[, n_quantiles, bin_labels, …])Fit the estimator. load_model
(serialized_model_name)Load a serialized model predict
([X, prob_score, threshold, pos_label])Predict the class for input ‘X’ The predicted class is determined by setting a threshold. predict_proba
(X)Computes possible class probabilities for the input ‘X’ print_model
()print the decision stumps of the learned estimator save_model
(model_name[, compress])Persist the model for future use set_params
(params)Set model hyperparameters 
access_learned_rules
(rule_indexes='all')¶ Access all learned decision rules. This is useful for building and developing intuition
Parameters:  rule_indexes: str (default=”all”, retrieves all the rules)
Specify the index of the rules to be retrieved index could be set as ‘all’ or a range could be specified e.g. ‘(1:3)’ will retrieve the rules 1 and 2

discretizer
(X, column_list, no_of_quantiles=None, labels_for_bin=None, precision=3)¶ A discretizer for continuous features
Parameters:  X : pandas.DataFrame
Dataframe containing continuous features
 column_list : list/tuple
 no_of_quantiles : int or list
Number of quantiles, e.g. deciles(10), quartiles(4) or as a list of quantiles[0, .25, .5, .75, 1.] if ‘None’ then [0, .25, .5, .75, 1.] is used
 labels_for_bin : labels for the resulting bins
 precision : int
precision for storing and creating bins
Returns:  new_X: pandas.DataFrame
Contains discretized features
Examples
>>> sbrl_model = BRLC(min_rule_len=1, max_rule_len=10, iterations=10000, n_chains=20, drop_features=True) >>> ... >>> features_to_descritize = Xtrain.columns >>> Xtrain_discretized = sbrl_model.discretizer(Xtrain, features_to_descritize, labels_for_bin="default") >>> predict_scores = sbrl_model.predict_proba(Xtrain_discretized)

fit
(X, y_true, n_quantiles=None, bin_labels='default', undiscretize_feature_list=None, precision=3)¶ Fit the estimator.
Parameters:  X : pandas.DataFrame object, that could be used by the model for training.
It must not have a column named ‘label’
y_true : pandas.Series, 1D array to store ground truth labels
Returns:  SBRL model instance: rpy2.robjects.vectors.ListVector
Examples
>>> from skater.core.global_interpretation.interpretable_models.brlc import BRLC >>> sbrl_model = BRLC(min_rule_len=1, max_rule_len=10, iterations=10000, n_chains=20, drop_features=True) >>> # Train a model, by default discretizer is enabled. So, you wish to exclude features then exclude them using >>> # the undiscretize_feature_list parameter >>> model = sbrl_model.fit(Xtrain, ytrain, bin_labels="default")

load_model
(serialized_model_name)¶ Load a serialized model

predict
(X=None, prob_score=None, threshold=0.5, pos_label=1)¶ Predict the class for input ‘X’ The predicted class is determined by setting a threshold. Adjust threshold to balance between sensitivity and specificity
Parameters:  X: pandas.DataFrame
input examples to be scored
 prob_score: pandas.DataFrame or None (default=None)
If set to None, predict_proba is called before computing the class labels. If you have access to probability scores already, use the dataframe of probability scores to compute the final class label
 threshold: float (default=0.5)
 pos_label: int (default=1)
specify how to identify positive label
Returns:  y_prob, y_prob[‘label]: pandas.Series, numpy.ndarray
Contains the probability score for the input ‘X’

predict_proba
(X)¶ Computes possible class probabilities for the input ‘X’
Parameters:  X: pandas.DataFrame object
Returns:  pandas.DataFrame of shape (#datapoints, 2), the possible probability of each class for each observation

print_model
()¶ print the decision stumps of the learned estimator

save_model
(model_name, compress=True)¶ Persist the model for future use

set_params
(params)¶ Set model hyperparameters

class
skater.core.global_interpretation.interpretable_models.bigdatabrlc.
BigDataBRLC
(sub_sample_percentage=0.1, iterations=30000, pos_sign=1, neg_sign=0, min_rule_len=1, max_rule_len=8, min_support_pos=0.1, min_support_neg=0.1, eta=1.0, n_chains=10, alpha=1, lambda_=8, discretize=True, drop_features=False, threshold=0.5, penalty_param_svm=0.01, calibration_type='sigmoid', cv_calibration=3, random_state=0, surrogate_estimator='SVM')¶ :: Experimental :: The implementation is currently experimental and might change in future
BigDataBRLC is a BRLC to handle large datasets. Advisable to be used when the number of input examples>1k. It approximates large datasets with the help of surrogate(metamodel) estimators. For example, it uses surrogate estimator such as SVC(Support Vector Classifier) or RandomForest by default to filter the data points which are closest to the decision boundary. The idea is to identify the minimum training set size (controlled by the parameter sub_sample_percentage) with the goal to maximize accuracy. This helps in reducing the computation time to build the final BRL.
Parameters:  sub_sample_percentage : float (default=0.1)
specify the fraction of the training sample to be retained for training BRL.
 iterations : int (default=30000)
number of iterations for each MCMC chain.
 pos_sign : int (default=1)
sign for the positive labels in the “label” column.
 neg_sign : int (default=0)
sign for the negative labels in the “label” column.
 min_rule_len : int (default=1)
minimum number of cardinality for rules to be mined from the dataframe.
 max_rule_len : int (default=8)
maximum number of cardinality for rules to be mined from the dataframe.
 min_support_pos : float (default=0.1)
a number between 0 and 1, for the minimum percentage support for the positive observations.
 min_support_neg : float (default 0.1)
a number between 0 and 1, for the minimum percentage support for the negative observations.
 eta : int (default=1)
 n_chains: int (default=10)
 alpha : int (default=1)
a prior pseudocount for the positive(alpha1) and negative(alpha0) classes. Default values (1, 1)
 lambda_ : int (default=8)
a hyperparameter for the expected length of the rule list.
 discretize : bool (default=True)
apply discretizer to handle continuous features.
 drop_features : bool (default=False)
once continuous features are discretized, use this flag to either retain or drop them from the dataframe
 threshold : float (default=0.5)
specify the threshold for the decision boundary. This is the probability level to compute distance of the predictions(for input examples) from the decision boundary. Input examples closest to the decision boundary are subsampled. Size of subsampled data is controlled using ‘sub_sample_percentage’.
 penalty_param_svm : float (default=0.01)
Regularization parameter(‘C’) for Linear Support Vector Classifier. Lower regularization value forces the optimizer to maximize the hyperplane.
References: https://stats.stackexchange.com/questions/31066/whatistheinfluenceofcinsvmswithlinearkernel
 calibration_type : string (default=’sigmoid’)
Calibrate the base estimator’s prediction(currently, all the base estimators are calibrated, that might change in future with more experimentation). Calibration could be performed in 2 ways 1. parametric approach using Platt Scaling (‘sigmoid’) 2. nonparametric approach using isotonic regression(‘isotonic). Avoid using isotonic regression for input examples<<1k because it tends to overfit.
References:
[1] A. NiculescuMizil & R. Caruana(ICML2005), Predicting Good Probabilities With Supervised Learning
[2] https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
[3] http://fastml.com/classifiercalibrationwithplattsscalingandisotonicregression/
 cv_calibration : int (default=3)
specify number of folds for crossvalidation splitting strategy
 random_state: int (default=0)
 surrogate_estimator: string (default=’SVM’, ‘RF’: RandomForest)
Surrogate model to build the initial model for handling large datasets. Currently, SVM and RandomForest is supported.
References
[1] Dr. Tamas Madl, https://github.com/tmadl/sklearnexpertsys/blob/master/BigDataRuleListClassifier.py [2] https://pdfs.semanticscholar.org/e44c/9dcf90d5a9a7e74a1d74c9900ff69142c67f.pdf [3] Surrogate model: https://en.wikipedia.org/wiki/Surrogate_model [4] W. Andrew Pruett , Robert L. Hester(2016), The Creation of Surrogate Models for Fast Estimation of Complex Model Outcomes (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0156574) Examples
>>> from skater.core.global_interpretation.interpretable_models.brlc import BRLC >>> from skater.core.global_interpretation.interpretable_models.bigdatabrlc import BigDataBRLC >>> import pandas as pd >>> from sklearn.model_selection import train_test_split ... >>> Xtrain, Xtest, ytrain, ytest = train_test_split(X, y) >>> input_df = pd.read_csv('input_data.csv', skiprows=1) >>> sbrl_big = BigDataBRLC(sub_sample_percentage=0.1, min_rule_len=1, max_rule_len=3, iterations=10000, ... n_chains=3, surrogate_estimator="SVM", drop_features=True) >>> n_x, n_y = sbrl_big.subsample(Xtrain, ytrain, pos_label=1) >>> model = sbrl_big.fit(n_x, n_y, bin_labels='default') # For a complete example refer to credit_analysis_rule_lists.ipynb notebook in the `examples` section
Methods
access_learned_rules
([rule_indexes])Access all learned decision rules. discretizer
(X, column_list[, …])A discretizer for continuous features filter_to_be_discretize
(clmn_list, unwanted_list)fit
(X, y_true[, n_quantiles, bin_labels, …])Fit the estimator. load_model
(serialized_model_name)Load a serialized model predict
([X, prob_score, threshold, pos_label])Predict the class for input ‘X’ The predicted class is determined by setting a threshold. predict_proba
(X)Computes possible class probabilities for the input ‘X’ print_model
()print the decision stumps of the learned estimator save_model
(model_name[, compress])Persist the model for future use set_params
(params)Set model hyperparameters subsample
(X, y[, pos_label, neg_label])subsampler to filter the input examples closer to the decision boundary 
access_learned_rules
(rule_indexes='all')¶ Access all learned decision rules. This is useful for building and developing intuition
Parameters:  rule_indexes: str (default=”all”, retrieves all the rules)
Specify the index of the rules to be retrieved index could be set as ‘all’ or a range could be specified e.g. ‘(1:3)’ will retrieve the rules 1 and 2

discretizer
(X, column_list, no_of_quantiles=None, labels_for_bin=None, precision=3)¶ A discretizer for continuous features
Parameters:  X : pandas.DataFrame
Dataframe containing continuous features
 column_list : list/tuple
 no_of_quantiles : int or list
Number of quantiles, e.g. deciles(10), quartiles(4) or as a list of quantiles[0, .25, .5, .75, 1.] if ‘None’ then [0, .25, .5, .75, 1.] is used
 labels_for_bin : labels for the resulting bins
 precision : int
precision for storing and creating bins
Returns:  new_X: pandas.DataFrame
Contains discretized features
Examples
>>> sbrl_model = BRLC(min_rule_len=1, max_rule_len=10, iterations=10000, n_chains=20, drop_features=True) >>> ... >>> features_to_descritize = Xtrain.columns >>> Xtrain_discretized = sbrl_model.discretizer(Xtrain, features_to_descritize, labels_for_bin="default") >>> predict_scores = sbrl_model.predict_proba(Xtrain_discretized)

fit
(X, y_true, n_quantiles=None, bin_labels='default', undiscretize_feature_list=None, precision=3)¶ Fit the estimator.
Parameters:  X : pandas.DataFrame object, that could be used by the model for training.
It must not have a column named ‘label’
y_true : pandas.Series, 1D array to store ground truth labels
Returns:  SBRL model instance: rpy2.robjects.vectors.ListVector
Examples
>>> from skater.core.global_interpretation.interpretable_models.brlc import BRLC >>> sbrl_model = BRLC(min_rule_len=1, max_rule_len=10, iterations=10000, n_chains=20, drop_features=True) >>> # Train a model, by default discretizer is enabled. So, you wish to exclude features then exclude them using >>> # the undiscretize_feature_list parameter >>> model = sbrl_model.fit(Xtrain, ytrain, bin_labels="default")

load_model
(serialized_model_name)¶ Load a serialized model

predict
(X=None, prob_score=None, threshold=0.5, pos_label=1)¶ Predict the class for input ‘X’ The predicted class is determined by setting a threshold. Adjust threshold to balance between sensitivity and specificity
Parameters:  X: pandas.DataFrame
input examples to be scored
 prob_score: pandas.DataFrame or None (default=None)
If set to None, predict_proba is called before computing the class labels. If you have access to probability scores already, use the dataframe of probability scores to compute the final class label
 threshold: float (default=0.5)
 pos_label: int (default=1)
specify how to identify positive label
Returns:  y_prob, y_prob[‘label]: pandas.Series, numpy.ndarray
Contains the probability score for the input ‘X’

predict_proba
(X)¶ Computes possible class probabilities for the input ‘X’
Parameters:  X: pandas.DataFrame object
Returns:  pandas.DataFrame of shape (#datapoints, 2), the possible probability of each class for each observation

print_model
()¶ print the decision stumps of the learned estimator

save_model
(model_name, compress=True)¶ Persist the model for future use

set_params
(params)¶ Set model hyperparameters

subsample
(X, y, pos_label=1, neg_label=0)¶ subsampler to filter the input examples closer to the decision boundary
Parameters:  X : pandas.DataFrame
input examples representing the training set
 y : pandas.DataFrame
target labels associated with the training set
 pos_label : int
 neg_label : int
Returns:  X_, y_ : pandas.dataframe
 subsampled input examples