Classification¶

Setup¶

In [ ]:

Copied!

pip install ydf -U
pip install ydf -U

What is classification?¶

Classification is the task of predicting a categorical value, such as an enum, type, or class from a finite set of possible values. For instance, predicting a color from the set of possible colors RED, BLUE, GREEN is a classification task. The output of classification models is a probability distribution over the possible classes. The predicted class is the one with the highest probability.

When there are only two classes, we call it binary classification. In this case, models only return one probability.

Classification labels can be strings, integers, or boolean values.

Training a classification model¶

The task of a model (e.g., classification, regression) is determined by the task learner argument. The default value of this argument is ydf.Task.CLASSIFICATION, which means that by default, YDF trains classification models.

In [2]:

Copied!





# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets

# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)
# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets

# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)

Out[2]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	44	Private	228057	7th-8th	4	Married-civ-spouse	Machine-op-inspct	Wife	White	Female	0	40	Dominican-Republic	<=50K
1	20	Private	299047	Some-college	10	Never-married	Other-service	Not-in-family	White	Female	0	20	United-States	<=50K
2	40	Private	342164	HS-grad	9	Separated	Adm-clerical	Unmarried	White	Female	0	37	United-States	<=50K
3	30	Private	361742	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	50	United-States	<=50K
4	67	Self-emp-inc	171564	HS-grad	9	Married-civ-spouse	Prof-specialty	Wife	White	Female	20051	30	England	>50K

The label column is:

In [3]:

Copied!

train_ds["income"]
train_ds["income"]

Out[3]:

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4         >50K
         ...  
22787    <=50K
22788     >50K
22789    <=50K
22790    <=50K
22791    <=50K
Name: income, Length: 22792, dtype: object

We can train a classification model:

In [4]:

Copied!

model = ydf.RandomForestLearner(label="income",
                                task=ydf.Task.CLASSIFICATION).train(train_ds)
# Note: ydf.Task.CLASSIFICATION is the default value of "task"

assert model.task() == ydf.Task.CLASSIFICATION
model = ydf.RandomForestLearner(label="income",
                                task=ydf.Task.CLASSIFICATION).train(train_ds)
# Note: ydf.Task.CLASSIFICATION is the default value of "task"

assert model.task() == ydf.Task.CLASSIFICATION

Train model on 22792 examples
Model trained in 0:00:01.179527

Classification models are evaluated using accuracy, confusion matrices, ROC-AUC and PR-AUC.

In [5]:

Copied!

evaluation = model.evaluate(test_ds)

print(evaluation)
evaluation = model.evaluate(test_ds)

print(evaluation)

accuracy: 0.866005
confusion matrix:
    label (row) \ prediction (col)
    +-------+-------+-------+
    |       | <=50K |  >50K |
    +-------+-------+-------+
    | <=50K |  6976 |   873 |
    +-------+-------+-------+
    |  >50K |   436 |  1484 |
    +-------+-------+-------+
characteristics:
    name: '>50K' vs others
    ROC AUC: 0.908676
    PR AUC: 0.790029
    Num thresholds: 302
loss: 0.394958
num examples: 9769
num examples (weighted): 9769

You can plot a rich evaluation with ROC and PR plots.

In [6]:

Copied!

evaluation
evaluation

Out[6]:

accuracy:

0.866005

AUC: '>50K' vs others:

0.908676

PR-AUC: '>50K' vs others:

0.790029

loss:

0.394958

num examples:

9769

num examples (weighted):

9769

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	6976	436
>50K	873	1484