Classification¶
Setup¶
pip install ydf -U
What is classification?¶
Classification is the task of predicting a categorical value, such as an enum, type, or class from a finite set of possible values. For instance, predicting a color from the set of possible colors RED, BLUE, GREEN is a classification task. The output of classification models is a probability distribution over the possible classes. The predicted class is the one with the highest probability.
When there are only two classes, we call it binary classification. In this case, models only return one probability.
Classification labels can be strings, integers, or boolean values.
Training a classification model¶
The task of a model (e.g., classification, regression) is determined by the task
learner argument. The default value of this argument is ydf.Task.CLASSIFICATION
, which means that by default, YDF trains classification models.
# Load libraries
import ydf # Yggdrasil Decision Forests
import pandas as pd # We use Pandas to load small datasets
# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# Print the first 5 training examples
train_ds.head(5)
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 44 | Private | 228057 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | Dominican-Republic | <=50K |
1 | 20 | Private | 299047 | Some-college | 10 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 20 | United-States | <=50K |
2 | 40 | Private | 342164 | HS-grad | 9 | Separated | Adm-clerical | Unmarried | White | Female | 0 | 0 | 37 | United-States | <=50K |
3 | 30 | Private | 361742 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
4 | 67 | Self-emp-inc | 171564 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 20051 | 0 | 30 | England | >50K |
The label column is:
train_ds["income"]
0 <=50K 1 <=50K 2 <=50K 3 <=50K 4 >50K ... 22787 <=50K 22788 >50K 22789 <=50K 22790 <=50K 22791 <=50K Name: income, Length: 22792, dtype: object
We can train a classification model:
model = ydf.RandomForestLearner(label="income",
task=ydf.Task.CLASSIFICATION).train(train_ds)
# Note: ydf.Task.CLASSIFICATION is the default value of "task"
assert model.task() == ydf.Task.CLASSIFICATION
Train model on 22792 examples Model trained in 0:00:01.179527
Classification models are evaluated using accuracy, confusion matrices, ROC-AUC and PR-AUC.
evaluation = model.evaluate(test_ds)
print(evaluation)
accuracy: 0.866005 confusion matrix: label (row) \ prediction (col) +-------+-------+-------+ | | <=50K | >50K | +-------+-------+-------+ | <=50K | 6976 | 873 | +-------+-------+-------+ | >50K | 436 | 1484 | +-------+-------+-------+ characteristics: name: '>50K' vs others ROC AUC: 0.908676 PR AUC: 0.790029 Num thresholds: 302 loss: 0.394958 num examples: 9769 num examples (weighted): 9769
You can plot a rich evaluation with ROC and PR plots.
evaluation
Label \ Pred | <=50K | >50K |
---|---|---|
<=50K | 6976 | 436 |
>50K | 873 | 1484 |