Categorical¶
The way a feature is treated depends on its semantic, such as numerical, categorical, boolean, or text. If the semantic is not specified, it is inferred automatically. For example, float and integer features are detected as numerical, while strings are detected as categorical.
A categorical feature represents a type or class in a finite set of possible values without ordering. As an example, consider the color RED
in the set {RED
, BLUE
, GREEN
}.
Categorical features can be strings, bytes literals, integers, or booleans. Missing values are represented by "" (empty sting).
Let's train an example on categorical string feature.
import ydf
import pandas as pd
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": ["red", "red", "blue", "green"],
"feature_2": ["hot", "hot", "cold", ""],
})
model = ydf.RandomForestLearner(label="label").train(dataset)
Train model on 4 examples Model trained in 0:00:00.008941
We can see the features are detected as categorical in the Dataspec tab.
model.describe()
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 57 kB
Number of records: 4 Number of columns: 3 Number of columns by type: CATEGORICAL: 3 (100%) Columns: CATEGORICAL: 3 (100%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) 1: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%) 2: "feature_2" CATEGORICAL num-nas:1 (25%) has-dict vocab-size:1 num-oods:3 (100%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.57374 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
Variable importances measure the importance of an input feature for a model.
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: val:"false" prob:[0.5, 0.5]
Sometime, you might want to force a feature's semantic to be categorical.
In the next example, "feature_1" and "feature_2" are integers so they will be detected automatically as numerical. However, we want "feature_1" to be detected as categorical.
In the model description, notice that "feature_1" is categorical, while "feature_2" is numerical.
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [1, 2, 2, 1],
"feature_2": [5, 6, 7, 6],
})
model = ydf.RandomForestLearner(label="label",
features=[ydf.Feature("feature_1", ydf.Semantic.CATEGORICAL)],
include_all_columns=True,
).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".
model.describe()
Train model on 4 examples Model trained in 0:00:00.004352
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 57 kB
Number of records: 4 Number of columns: 3 Number of columns by type: CATEGORICAL: 2 (66.6667%) NUMERICAL: 1 (33.3333%) Columns: CATEGORICAL: 2 (66.6667%) 0: "feature_1" CATEGORICAL has-dict vocab-size:1 num-oods:4 (100%) 1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) NUMERICAL: 1 (33.3333%) 2: "feature_2" NUMERICAL mean:6 min:5 max:7 sd:0.707107 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.57805 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
Variable importances measure the importance of an input feature for a model.
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: val:"false" prob:[0.5, 0.5]