pip install ydf scikit-learn umap-learn plotly -U -q
Note: you may need to restart the kernel to use updated packages.
import ydf # Yggdrasil Decision Forests
import numpy as np
import pandas as pd # We use Pandas to load small datasets
What are neighbor examples and counterfactual examples?¶
Neighbor examples are examples that are similar to one another according to the model, that is, examples where the model gives the same predictions for the same reasons. Looking at the similarities and differences between the neighbors of an example is a great way to understand the predictions of the model for this example
Counterfactual examples are neighbor examples that don't have the same labels as an example of interest. When the model predictions are unexpected, looking at its counterfactual examples is a great way to understand why.
What is an example distance?¶
The distance between two examples is a numerical value between 0 and 1 indicating how different two examples are. The neighbors of an example of interest are the examples with the smallest distances.
Decision forest models define an implicit measure of proximity or similarity between two examples, referred to as distance. The distance represents how two examples are treated similarly in the model. Informally, two examples are close if they are of the same class and for the same reasons.
This distance is useful for understanding models and their predictions. For example, we can use it for clustering, manifold learning, or simply to look at the training examples that are nearest to a test example (called counterfactual examples). This can help us to understand why the model made its predictions.
Keep in mind that a decision forest's distance measure is just one of many reasonable distance metrics on a dataset. One of its many advantages is that allows comparing features on different scales and with different semantics.
In this notebook, we will train a model and use its distance to:
Find training examples that are neighbors of a test example and use them to explain the model's predictions.
Map all the examples onto an interactive two-dimensional plot (also known as a 2D manifold) and automatically detect two-dimensional clusters of examples that behave similarly.
Apply hierarchical clustering to explain how the model works as a whole.
The More You Know: Leo Breiman, the author of the random forest learning algorithm, proposed a method to measure the proximity between two examples using a pre-trained Random Forest (RF) model. He qualifies this method as "[...] one of the most useful tools in random forests.". When using Random Forest models, this is the distance used by YDF.
Find closest training examples to a test example¶
Let's download a classification dataset.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# Print the first 5 training examples
train_ds.head(5)
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 44 | Private | 228057 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | Dominican-Republic | <=50K |
1 | 20 | Private | 299047 | Some-college | 10 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 20 | United-States | <=50K |
2 | 40 | Private | 342164 | HS-grad | 9 | Separated | Adm-clerical | Unmarried | White | Female | 0 | 0 | 37 | United-States | <=50K |
3 | 30 | Private | 361742 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
4 | 67 | Self-emp-inc | 171564 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 20051 | 0 | 30 | England | >50K |
We train a random forest on this dataset.
model = ydf.RandomForestLearner(label="income").train(train_ds)
Train model on 22792 examples Model trained in 0:00:00.922407
We need to select a example to explain. Let's select the first example of the testing dataset.
selected_example_idx = 0 # Change to select another example
selected_example = test_ds[selected_example_idx:(selected_example_idx+1)]
selected_example
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
On this example, the model predicts:
model.predict(selected_example)
array([0.01], dtype=float32)
In other words, the negative class <=50K
with $1-0.01=99\%$ probability.
Now, we compute the distance between the selected test example and all the training examples.
distances = model.distance(train_ds, selected_example).squeeze()
print("distances:", distances)
distances: [1. 1. 1. ... 0.99333334 0.99666667 1. ]
Let's find the the five training examples with smallest distance to our chosen example.
close_train_idxs = np.argsort(distances)[:5]
print("close_train_idxs:", close_train_idxs)
print("Selected test examples:")
train_ds.iloc[close_train_idxs]
close_train_idxs: [16596 21845 10321 7299 14721] Selected test examples:
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16596 | 41 | State-gov | 26892 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
21845 | 37 | State-gov | 60227 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 0 | 0 | 38 | United-States | <=50K |
10321 | 40 | Private | 82161 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
7299 | 30 | State-gov | 158291 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
14721 | 32 | State-gov | 171111 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 0 | 0 | 37 | United-States | <=50K |
Observations:
- For the chosen example, the model predicted class
<=50K
. For the five closes examples, the model had the same prediction. - The closest examples share many features values, such as
education
,marital status
,occupation
,race
, and working between 37 and 40hours per week
. This explains well why these examples are close to each other. - The examples'
age
s range between 30 and 40, meaning the model sees this age range as equivalent for those examples.
import plotly.graph_objs as go
from plotly.offline import iplot # For interactive plots
import plotly.io as pio
pio.renderers.default="colab"
# Pairwise distance between all testing examples
distances = model.distance(test_ds, test_ds)
# Organize the examples on a 2D manifold.
# Select the method you want to use.
# Using different methods and different parameters will change the projection.
manifold_lib = "UMAP" # "UMAP" or "TSNE"
if manifold_lib == "TSNE":
from sklearn.manifold import TSNE
manifold = TSNE(
# Number of dimensions to display. 3d is also possible.
n_components=2,
# Control the shape of the projection. Higher values create more
# distinct but also more collapsed clusters. Can be in 5-50.
perplexity=20,
metric="precomputed",
init="random",
verbose=1,
learning_rate="auto",
).fit_transform(distances)
elif manifold_lib == "UMAP":
import umap
manifold = umap.UMAP(
# Number of dimensions to display. 3d is also possible.
n_components=2,
# Balance local versus global structure in the data.
n_neighbors=15,
metric="precomputed",
).fit_transform(distances)
else:
raise ValueError(f"Unknown lib: {manifold_lib}")
/usr/local/google/home/gbm/my_venv/lib/python3.11/site-packages/umap/umap_.py:1858: UserWarning: using precomputed metric; inverse_transform will be unavailable
Let's create an interactive plot with the example features.
def example_to_html(example):
return "<br>".join([f"<b>{k}:</b> {v}" for k, v in example.items()])
def interactive_plot(dataset, projections):
colors = (dataset["income"] == ">50K").map(lambda x: ["red", "blue"][x])
labels = list(dataset.apply(example_to_html, axis=1).values)
args = {
"data": [
go.Scatter(
x=projections[:, 0],
y=projections[:, 1],
text=labels,
mode="markers",
marker={"color": colors, "size": 3},
)
],
"layout": go.Layout(width=500, height=500, template="simple_white"),
}
iplot(args)
interactive_plot(test_ds, manifold)
".join([f"{k}: {v}" for k, v in example.items()]) def interactive_plot(dataset, projections): colors = (dataset["income"] == ">50K").map(lambda x: ["red", "blue"][x]) labels = list(dataset.apply(example_to_html, axis=1).values) args = { "data": [ go.Scatter( x=projections[:, 0], y=projections[:, 1], text=labels, mode="markers", marker={"color": colors, "size": 3}, ) ], "layout": go.Layout(width=500, height=500, template="simple_white"), } iplot(args) interactive_plot(test_ds, manifold)