# Uplifting¶

## Setup¶

```
pip install ydf -U
```

## What is Uplifting?¶

Uplift modeling is a statistical modeling technique to predict the **incremental impact of an action** on a subject. The action is often referred to as a **treatment** that may or may not be applied.

Uplift modeling is often used in targeted marketing campaigns to predict the increase in the likelihood of a person making a purchase (or any other desired action) based on the marketing exposition they receive.

For example, uplift modeling can predict the **effect** of an email. The effect is defined as the **conditional probability**
\begin{align}
\text{effect}(\text{email}) = &\Pr(\text{outcome}=\text{purchase}\ \vert\ \text{treatment}=\text{with email})\\ &- \Pr(\text{outcome}=\text{purchase} \ \vert\ \text{treatment}=\text{no email}),
\end{align}
where $\Pr(\text{outcome}=\text{purchase}\ \vert\ ...)$
is the probability of purchase depending on the receiving or not an email.

Compare this to a classification model: With a classification model, one can predict the probability of a purchase. However, customers with a high probability are likely to spend money in the store regardless of whether or not they received an email.

Similarly, one can use **numerical uplifting** to predict the numerical **increase in spend** when receiving an email. In comparison, a regression model can only increase the expected spend, which is a less useful metric in many cases.

### Defining uplift models in YDF¶

YDF expects uplifting datasets to be presented in a "flat" format. A dataset of customers might look like this

treatment | outcome | feature_1 | feature_2 |
---|---|---|---|

0 | 1 | 0.1 | blue |

0 | 0 | 0.2 | blue |

1 | 1 | 0.3 | blue |

1 | 1 | 0.4 | blue |

The **treatment** is a binary variable indicating whether or not the example has received treatment. In the above example, the treatment indicates if the customer has received an email or not. The **outcome** (label) indicates the status of the example after receiving the treatment (or not). TF-DF supports categorical outcomes for categorical uplifting and numerical outcomes for numerical uplifting.

**Note**: Uplifting is also frequently used in medical contexts. Here the *treatment* can be a medical treatment (e.g. administering a vaccine), the label can be an indicator of quality of life (e.g. whether the patient got sick). This also explains the nomenclature of uplift modeling.

## Training an Uplifting model¶

In this example, we will use an instance of the Simulations for Personalized Treatment Effects.

```
# Load libraries
import ydf # Yggdrasil Decision Forests
import pandas as pd # We use Pandas to load small datasets
import numpy as np
# Download and load a ranking datasets as Pandas DataFrames
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/sim_pte_train.csv")
test_ds = pd.read_csv(f"{ds_path}/sim_pte_test.csv")
# Print the first 5 examples
train_ds.head(5)
```

y | treat | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | ... | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 1 | 2.027911 | 0.278222 | 0.716672 | -1.092175 | -1.353849 | -0.910061 | -1.410070 | -0.150630 | ... | 1.931576 | 0.511000 | -1.618037 | -0.699228 | -0.494174 | 0.196550 | -0.150307 | -0.511604 | -0.995799 | -0.560476 |

1 | 2 | 2 | -1.494750 | -1.602538 | -0.283501 | -1.337542 | -0.579377 | 0.280663 | -1.721265 | 0.800941 | ... | -0.616475 | 1.807993 | 0.379181 | 0.996452 | 1.127593 | 0.650113 | -0.327757 | 0.236938 | -1.039955 | -0.230177 |

2 | 1 | 2 | -1.572949 | -0.320900 | -1.135464 | 1.109242 | -0.861044 | -1.035670 | 0.665445 | -1.186718 | ... | -0.562567 | -1.702615 | 1.902250 | -0.692745 | -1.146950 | 0.671004 | -1.448165 | -0.541589 | -0.017980 | 1.558708 |

3 | 1 | 2 | -0.300212 | -1.226114 | -0.632817 | 0.810701 | 0.972678 | 0.273049 | -0.430807 | 0.430636 | ... | -0.989963 | 0.287449 | 0.601874 | -0.103483 | 1.481019 | -1.284158 | -0.697285 | 1.219228 | -0.132175 | 0.070508 |

4 | 1 | 1 | -0.764373 | -0.776658 | 1.351161 | -0.875981 | 0.619146 | 0.537798 | -0.329039 | 0.216747 | ... | 2.731228 | -0.269114 | 1.732350 | 0.603866 | 0.916191 | -2.026110 | 2.598490 | 0.174136 | -2.549343 | 0.129288 |

5 rows × 22 columns

In this dataset, the treatments (`treat`

) and outcome (`y`

) are binary variables represented as "1" or "2" (instead of "0" and "1")

We can train an uplifting model:

```
model = ydf.RandomForestLearner(
label="y",
uplift_treatment="treat",
task=ydf.Task.CATEGORICAL_UPLIFT).train(train_ds)
```

Train model on 1000 examples Model trained in 0:00:00.075023

Uplifting models are evaluated using the QINI coefficient (area under the Qini curve) and and AUUC (Area Under Uplift Curve).

```
evaluation = model.evaluate(test_ds)
print(evaluation)
```

QINI: 0.106807 AUUC: 0.120807 num examples: 2000 num examples (weighted): 2000