Categorical¶

The way a feature is treated depends on its semantic, such as numerical, categorical, boolean, or text. If the semantic is not specified, it is inferred automatically. For example, float and integer features are detected as numerical, while strings are detected as categorical.

A categorical feature represents a type or class in a finite set of possible values without ordering. As an example, consider the color RED in the set {RED, BLUE, GREEN}. Categorical features can be strings, bytes literals, integers, or booleans. Missing values are represented by "" (empty sting).

Let's train an example on categorical string feature.

In [ ]:

Copied!

import ydf
import pandas as pd
import ydf
import pandas as pd

In [ ]:

Copied!





dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": ["red", "red", "blue", "green"],
    "feature_2": ["hot", "hot", "cold", ""],
})

model = ydf.RandomForestLearner(label="label").train(dataset)
dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": ["red", "red", "blue", "green"],
    "feature_2": ["hot", "hot", "cold", ""],
})

model = ydf.RandomForestLearner(label="label").train(dataset)

Train model on 4 examples
Model trained in 0:00:00.008941

We can see the features are detected as categorical in the Dataspec tab.

In [ ]:

Copied!

model.describe()
model.describe()

Out[ ]:

Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 57 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	CATEGORICAL: 3 (100%)

Columns:

CATEGORICAL: 3 (100%)
	0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)
	1: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%)
	2: "feature_2" CATEGORICAL num-nas:1 (25%) has-dict vocab-size:1 num-oods:3 (100%)

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.57374
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

Variable importances measure the importance of an input feature for a model.

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 300

Only printing the first tree.

Tree #0:
    val:"false" prob:[0.5, 0.5]

Sometime, you might want to force a feature's semantic to be categorical.

In the next example, "feature_1" and "feature_2" are integers so they will be detected automatically as numerical. However, we want "feature_1" to be detected as categorical.

In the model description, notice that "feature_1" is categorical, while "feature_2" is numerical.

In [ ]:

Copied!





dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [5, 6, 7, 6],
})

model = ydf.RandomForestLearner(label="label",
                                features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
                                include_all_columns=True,
                                ).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".

model.describe()
dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [5, 6, 7, 6],
})

model = ydf.RandomForestLearner(label="label",
                                features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
                                include_all_columns=True,
                                ).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".

model.describe()

Train model on 4 examples
Model trained in 0:00:00.004352

Out[ ]:

Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 57 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	CATEGORICAL: 2 (66.6667%)
	NUMERICAL: 1 (33.3333%)

Columns:

CATEGORICAL: 2 (66.6667%)
	0: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%)
	1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)

NUMERICAL: 1 (33.3333%)
	2: "feature_2" NUMERICAL mean:6 min:5 max:7 sd:0.707107

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.57805
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

Variable importances measure the importance of an input feature for a model.

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 300

Only printing the first tree.

Tree #0:
    val:"false" prob:[0.5, 0.5]