Uplifting¶
Setup¶
pip install ydf -U
What is Uplifting?¶
Uplift modeling is a statistical modeling technique to predict the incremental impact of an action on a subject. The action is often referred to as a treatment that may or may not be applied.
Uplift modeling is often used in targeted marketing campaigns to predict the increase in the likelihood of a person making a purchase (or any other desired action) based on the marketing exposition they receive.
For example, uplift modeling can predict the effect of an email. The effect is defined as the conditional probability \begin{align} \text{effect}(\text{email}) = &\Pr(\text{outcome}=\text{purchase}\ \vert\ \text{treatment}=\text{with email})\\ &- \Pr(\text{outcome}=\text{purchase} \ \vert\ \text{treatment}=\text{no email}), \end{align} where $\Pr(\text{outcome}=\text{purchase}\ \vert\ ...)$ is the probability of purchase depending on the receiving or not an email.
Compare this to a classification model: With a classification model, one can predict the probability of a purchase. However, customers with a high probability are likely to spend money in the store regardless of whether or not they received an email.
Similarly, one can use numerical uplifting to predict the numerical increase in spend when receiving an email. In comparison, a regression model can only increase the expected spend, which is a less useful metric in many cases.
Defining uplift models in YDF¶
YDF expects uplifting datasets to be presented in a "flat" format. A dataset of customers might look like this
treatment | outcome | feature_1 | feature_2 |
---|---|---|---|
0 | 1 | 0.1 | blue |
0 | 0 | 0.2 | blue |
1 | 1 | 0.3 | blue |
1 | 1 | 0.4 | blue |
The treatment is a binary variable indicating whether or not the example has received treatment. In the above example, the treatment indicates if the customer has received an email or not. The outcome (label) indicates the status of the example after receiving the treatment (or not). TF-DF supports categorical outcomes for categorical uplifting and numerical outcomes for numerical uplifting.
Note: Uplifting is also frequently used in medical contexts. Here the treatment can be a medical treatment (e.g. administering a vaccine), the label can be an indicator of quality of life (e.g. whether the patient got sick). This also explains the nomenclature of uplift modeling.
Training an Uplifting model¶
In this example, we will use an instance of the Simulations for Personalized Treatment Effects.
# Load libraries
import ydf # Yggdrasil Decision Forests
import pandas as pd # We use Pandas to load small datasets
import numpy as np
# Download and load a ranking datasets as Pandas DataFrames
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/sim_pte_train.csv")
test_ds = pd.read_csv(f"{ds_path}/sim_pte_test.csv")
# Print the first 5 examples
train_ds.head(5)
y | treat | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | ... | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 2.027911 | 0.278222 | 0.716672 | -1.092175 | -1.353849 | -0.910061 | -1.410070 | -0.150630 | ... | 1.931576 | 0.511000 | -1.618037 | -0.699228 | -0.494174 | 0.196550 | -0.150307 | -0.511604 | -0.995799 | -0.560476 |
1 | 2 | 2 | -1.494750 | -1.602538 | -0.283501 | -1.337542 | -0.579377 | 0.280663 | -1.721265 | 0.800941 | ... | -0.616475 | 1.807993 | 0.379181 | 0.996452 | 1.127593 | 0.650113 | -0.327757 | 0.236938 | -1.039955 | -0.230177 |
2 | 1 | 2 | -1.572949 | -0.320900 | -1.135464 | 1.109242 | -0.861044 | -1.035670 | 0.665445 | -1.186718 | ... | -0.562567 | -1.702615 | 1.902250 | -0.692745 | -1.146950 | 0.671004 | -1.448165 | -0.541589 | -0.017980 | 1.558708 |
3 | 1 | 2 | -0.300212 | -1.226114 | -0.632817 | 0.810701 | 0.972678 | 0.273049 | -0.430807 | 0.430636 | ... | -0.989963 | 0.287449 | 0.601874 | -0.103483 | 1.481019 | -1.284158 | -0.697285 | 1.219228 | -0.132175 | 0.070508 |
4 | 1 | 1 | -0.764373 | -0.776658 | 1.351161 | -0.875981 | 0.619146 | 0.537798 | -0.329039 | 0.216747 | ... | 2.731228 | -0.269114 | 1.732350 | 0.603866 | 0.916191 | -2.026110 | 2.598490 | 0.174136 | -2.549343 | 0.129288 |
5 rows × 22 columns
In this dataset, the treatments (treat
) and outcome (y
) are binary variables represented as "1" or "2" (instead of "0" and "1")
We can train an uplifting model:
model = ydf.RandomForestLearner(
label="y",
uplift_treatment="treat",
task=ydf.Task.CATEGORICAL_UPLIFT).train(train_ds)
Train model on 1000 examples Model trained in 0:00:00.075023
Uplifting models are evaluated using the QINI coefficient (area under the Qini curve) and and AUUC (Area Under Uplift Curve).
evaluation = model.evaluate(test_ds)
print(evaluation)
QINI: 0.106807 AUUC: 0.120807 num examples: 2000 num examples (weighted): 2000