Numerical¶
The way a feature is treated depends on its semantic, such as numerical, categorical, boolean, or text. If the semantic is not specified, it is inferred automatically. For example, float and integer features are detected as numerical, while strings are detected as categorical.
A numerical feature can represent a quantity or counts. For example, the age of a person, or the number of items in a bag. Missing numerical values are represented by math.nan
.
Let's train an example on a floating point feature.
import ydf
import pandas as pd
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [1, 2, 2, 1],
"feature_2": [0.1, 0.8, 0.9, 0.1],
})
model = ydf.RandomForestLearner(label="label").train(dataset)
Train model on 4 examples Model trained in 0:00:00.009728
We can see the feature is detected as numerical in the Dataspec tab.
model.describe()
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 56 kB
Number of records: 4 Number of columns: 3 Number of columns by type: NUMERICAL: 2 (66.6667%) CATEGORICAL: 1 (33.3333%) Columns: NUMERICAL: 2 (66.6667%) 1: "feature_1" NUMERICAL mean:1.5 min:1 max:2 sd:0.5 2: "feature_2" NUMERICAL mean:0.475 min:0.1 max:0.9 sd:0.376663 CATEGORICAL: 1 (33.3333%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.59508 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
Variable importances measure the importance of an input feature for a model.
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: val:"false" prob:[0.5, 0.5]
Sometime, you might want to force a feature's semantic to be numerical.
In the next example, "feature_1" and "feature_2" look boolean. However, we want "feature_1" to be numerical.
In the model description, notice that "feature_1" is numerical, while "feature_2" is boolean.
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [True, True, False, False],
"feature_2": [True, False, True, False],
})
model = ydf.RandomForestLearner(label="label",
features=[ydf.Feature("feature_1", ydf.Semantic.NUMERICAL)],
include_all_columns=True,
).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".
model.describe()
Train model on 4 examples Model trained in 0:00:00.004133
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 56 kB
Number of records: 4 Number of columns: 3 Number of columns by type: BOOLEAN: 1 (33.3333%) CATEGORICAL: 1 (33.3333%) NUMERICAL: 1 (33.3333%) Columns: BOOLEAN: 1 (33.3333%) 2: "feature_2" BOOLEAN true_count:2 false_count:2 CATEGORICAL: 1 (33.3333%) 1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) NUMERICAL: 1 (33.3333%) 0: "feature_1" NUMERICAL mean:0.5 min:0 max:1 sd:0.5 Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.66759 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
Variable importances measure the importance of an input feature for a model.
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: val:"false" prob:[0.5, 0.5]
Let's create some missing values.
In the Dataspec tabe, notice num-nas:2 (50%)
for "feature_2". It means that "feature_2" contains two missing values (i.e., 50% of the values are missing).
import math
dataset = pd.DataFrame({
"label": [True, False, True, False],
"feature_1": [1, 2, 2, 1],
"feature_2": [0.1, 0.8, math.nan, math.nan],
})
model = ydf.RandomForestLearner(label="label").train(dataset)
model.describe()
Train model on 4 examples Model trained in 0:00:00.005587
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 56 kB
Number of records: 4 Number of columns: 3 Number of columns by type: NUMERICAL: 2 (66.6667%) CATEGORICAL: 1 (33.3333%) Columns: NUMERICAL: 2 (66.6667%) 1: "feature_1" NUMERICAL mean:1.5 min:1 max:2 sd:0.5 2: "feature_2" NUMERICAL num-nas:2 (50%) mean:0.45 min:0.1 max:0.8 sd:0.35 CATEGORICAL: 1 (33.3333%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Number of predictions (without weights): 4 Number of predictions (with weights): 4 Task: CLASSIFICATION Label: label Accuracy: 0 CI95[W][0 0.527129] LogLoss: : 1.6165 ErrorRate: : 1 Default Accuracy: : 0.5 Default LogLoss: : 0.693147 Default ErrorRate: : 0.5 Confusion Table: truth\prediction false true false 0 2 true 2 0 Total: 4
Variable importances measure the importance of an input feature for a model.
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: val:"false" prob:[0.5, 0.5]