Numerical¶

The way a feature is treated depends on its semantic, such as numerical, categorical, boolean, or text. If the semantic is not specified, it is inferred automatically. For example, float and integer features are detected as numerical, while strings are detected as categorical.

A numerical feature can represent a quantity or counts. For example, the age of a person, or the number of items in a bag. Missing numerical values are represented by math.nan.

Let's train an example on a floating point feature.

In [2]:

Copied!

import ydf
import pandas as pd
import ydf
import pandas as pd

In [3]:

Copied!





dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [0.1, 0.8, 0.9, 0.1],
})

model = ydf.RandomForestLearner(label="label").train(dataset)
dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [0.1, 0.8, 0.9, 0.1],
})

model = ydf.RandomForestLearner(label="label").train(dataset)

Train model on 4 examples
Model trained in 0:00:00.009728

We can see the feature is detected as numerical in the Dataspec tab.

In [3]:

Copied!

model.describe()
model.describe()

Out[3]:

Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 56 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	NUMERICAL: 2 (66.6667%)
	CATEGORICAL: 1 (33.3333%)

Columns:

NUMERICAL: 2 (66.6667%)
	1: "feature_1" NUMERICAL mean:1.5 min:1 max:2 sd:0.5
	2: "feature_2" NUMERICAL mean:0.475 min:0.1 max:0.9 sd:0.376663

CATEGORICAL: 1 (33.3333%)
	0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.59508
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

Variable importances measure the importance of an input feature for a model.

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 300

Only printing the first tree.

Tree #0:
    val:"false" prob:[0.5, 0.5]

Sometime, you might want to force a feature's semantic to be numerical.

In the next example, "feature_1" and "feature_2" look boolean. However, we want "feature_1" to be numerical.

In the model description, notice that "feature_1" is numerical, while "feature_2" is boolean.

In [4]:

Copied!





dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [True, True, False, False],
    "feature_2": [True, False, True, False],
})

model = ydf.RandomForestLearner(label="label",
                                features=[ydf.Feature("feature_1", ydf.Semantic.NUMERICAL)],
                                include_all_columns=True,
                                ).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".

model.describe()
dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [True, True, False, False],
    "feature_2": [True, False, True, False],
})

model = ydf.RandomForestLearner(label="label",
                                features=[ydf.Feature("feature_1", ydf.Semantic.NUMERICAL)],
                                include_all_columns=True,
                                ).train(dataset)
# Note: include_all_columns=True allows the model to use all the
# columns as features, not just the ones in "features".

model.describe()

Train model on 4 examples
Model trained in 0:00:00.004133

Out[4]:

Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 56 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	BOOLEAN: 1 (33.3333%)
	CATEGORICAL: 1 (33.3333%)
	NUMERICAL: 1 (33.3333%)

Columns:

BOOLEAN: 1 (33.3333%)
	2: "feature_2" BOOLEAN true_count:2 false_count:2

CATEGORICAL: 1 (33.3333%)
	1: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)

NUMERICAL: 1 (33.3333%)
	0: "feature_1" NUMERICAL mean:0.5 min:0 max:1 sd:0.5

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.66759
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

Variable importances measure the importance of an input feature for a model.

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 300

Only printing the first tree.

Tree #0:
    val:"false" prob:[0.5, 0.5]

Let's create some missing values.

In the Dataspec tabe, notice num-nas:2 (50%) for "feature_2". It means that "feature_2" contains two missing values (i.e., 50% of the values are missing).

In [6]:

Copied!





import math

dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [0.1, 0.8, math.nan, math.nan],
})

model = ydf.RandomForestLearner(label="label").train(dataset)
model.describe()
import math

dataset = pd.DataFrame({
    "label": [True, False, True, False],
    "feature_1": [1, 2, 2, 1],
    "feature_2": [0.1, 0.8, math.nan, math.nan],
})

model = ydf.RandomForestLearner(label="label").train(dataset)
model.describe()

Train model on 4 examples
Model trained in 0:00:00.005587

Out[6]:

Name : RANDOM_FOREST
Task : CLASSIFICATION
Label : label
Features (2) : feature_1 feature_2
Weights : None
Trained with tuner : No
Model size : 56 kB

Number of records: 4
Number of columns: 3

Number of columns by type:
	NUMERICAL: 2 (66.6667%)
	CATEGORICAL: 1 (33.3333%)

Columns:

NUMERICAL: 2 (66.6667%)
	1: "feature_1" NUMERICAL mean:1.5 min:1 max:2 sd:0.5
	2: "feature_2" NUMERICAL num-nas:2 (50%) mean:0.45 min:0.1 max:0.8 sd:0.35

CATEGORICAL: 1 (33.3333%)
	0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"false" 2 (50%)

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Number of predictions (without weights): 4
Number of predictions (with weights): 4
Task: CLASSIFICATION
Label: label

Accuracy: 0  CI95[W][0 0.527129]
LogLoss: : 1.6165
ErrorRate: : 1

Default Accuracy: : 0.5
Default LogLoss: : 0.693147
Default ErrorRate: : 0.5

Confusion Table:
truth\prediction
       false  true
false      0     2
 true      2     0
Total: 4

Variable importances measure the importance of an input feature for a model.

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 300

Only printing the first tree.

Tree #0:
    val:"false" prob:[0.5, 0.5]