Train & Test¶
A simple approach to estimating the quality of a model is the train and test protocol.
A model can be trained with the train
method, and evaluated with the evaluate
method. Here, we train a model on a training dataset, and evaluate it on a separate test dataset.
import ydf
import pandas as pd
# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# Print the first 5 training examples
train_ds.head(5)
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 44 | Private | 228057 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | Dominican-Republic | <=50K |
1 | 20 | Private | 299047 | Some-college | 10 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 20 | United-States | <=50K |
2 | 40 | Private | 342164 | HS-grad | 9 | Separated | Adm-clerical | Unmarried | White | Female | 0 | 0 | 37 | United-States | <=50K |
3 | 30 | Private | 361742 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
4 | 67 | Self-emp-inc | 171564 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 20051 | 0 | 30 | England | >50K |
model = ydf.RandomForestLearner(label="income").train(train_ds)
evaluation = model.evaluate(test_ds)
evaluation
Train model on 22792 examples Model trained in 0:00:01.169327
Label \ Pred | <=50K | >50K |
---|---|---|
<=50K | 6976 | 436 |
>50K | 873 | 1484 |
Test metrics are a useful way to compare the quality of models. We have already trained a Random Forest model. Now, let's train a Gradient Boosted Tree model and see which one performs better.
Note: In practice, we may want to tune the hyperparameters of the model. However, for the sake of simplicity, we will use the default hyperparameters of the learners.
model_2 = ydf.GradientBoostedTreesLearner(label="income").train(train_ds)
evaluation_2 = model_2.evaluate(test_ds)
Train model on 22792 examples Model trained in 0:00:03.561738
Let's look at the test metrics:
print(f"""\
Test accuracy:
Random Forest: {evaluation.accuracy:.4f}
Gradient Boosted Trees: {evaluation_2.accuracy:.4f}
Test AUC:
Random Forest: {evaluation.characteristics[0].auc:.4f}
Gradient Boosted Trees: {evaluation_2.characteristics[0].auc:.4f}
""")
Test accuracy: Random Forest: 0.8660 Gradient Boosted Trees: 0.8739 Test AUC: Random Forest: 0.9087 Gradient Boosted Trees: 0.9296
It looks like for this dataset the Gradient Boosted Trees model with default hyper-parameters is better than the Random Forest.