PandasĀ¶
SetupĀ¶
pip install ydf -U
PandasĀ¶
YDF can train directly on Pandas dataframes. YDF tries to infer column semantics automatically. For more fine-grained control, YDF offers advanced options for specifying column semantics.
# Load libraries
import ydf # Yggdrasil Decision Forests
import pandas as pd
import numpy as np
# Create a small dataframe with different column types.
df = pd.DataFrame(
{"col_cat_1": ["a", "b", "c"]*20,
"col_cat_2": ["x", "x", "x", "y", "y", "y"]*9 + ["q", "q", "w", "w", "r", "r"],
"col_int": list(range(60)),
"col_float": np.linspace(0,1,60),
"col_bool": [True, False]*30
})
df.head()
col_cat_1 | col_cat_2 | col_int | col_float | col_bool | |
---|---|---|---|---|---|
0 | a | x | 0 | 0.000000 | True |
1 | b | x | 1 | 0.016949 | False |
2 | c | x | 2 | 0.033898 | True |
3 | a | y | 3 | 0.050847 | False |
4 | b | y | 4 | 0.067797 | True |
We can directly train a model on this dataframe.
model_1 = ydf.RandomForestLearner(label="col_cat_1", num_trees=10).train(df)
# See the data specification in the dataspec tab.
model_1.describe()
WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus. [INFO 23-10-31 18:13:26.2598 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size. [INFO 23-10-31 18:13:26.2622 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features. [INFO 23-10-31 18:13:26.2623 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set. [INFO 23-10-31 18:13:26.2623 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s). [INFO 23-10-31 18:13:26.2694 UTC random_forest.cc:802] Training of tree 1/10 (tree index:0) done accuracy:0.0952381 logloss:32.6109 [INFO 23-10-31 18:13:26.2696 UTC random_forest.cc:802] Training of tree 10/10 (tree index:8) done accuracy:0.166667 logloss:17.4497 [INFO 23-10-31 18:13:26.2708 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.166667 logloss:17.4497 [INFO 23-10-31 18:13:26.2721 UTC abstract_model.cc:881] Model self evaluation: Number of predictions (without weights): 60 Num
'Type: "RANDOM_FOREST"\nTask: CLASSIFICATION\nLabel: "col_cat_1"\n\nInput Features (4):\n\tcol_cat_2\n\tcol_int\n\tcol_float\n\tcol_bool\n\nNo weights\n\nVariable Importance: INV_MEAN_MIN_DEPTH:\n 1. "col_int" 0.532151 ################\n 2. "col_float" 0.411677 #########\n 3. "col_cat_2" 0.254803 \n 4. "col_bool" 0.239653 \n\nVariable Importance: NUM_AS_ROOT:\n 1. "col_int" 6.000000 ################\n 2. "col_float" 3.000000 ######\n 3. "col_cat_2" 1.000000 \n\nVariable Importance: NUM_NODES:\n 1. "col_float" 29.000000 ################\n 2. "col_int" 26.000000 #############\n 3. "col_bool" 12.000000 \n 4. "col_cat_2" 11.000000 \n\nVariable Importance: SUM_SCORE:\n 1. "col_bool" 73.616851 ################\n 2. "col_float" 70.501937 ##############\n 3. "col_int" 59.014978 ##########\n 4. "col_cat_2" 32.388708 \n\n\n\nWinner takes all: true\nOut-of-bag evaluation: accuracy:0.166667 logloss:17.4497\nNumber of trees: 10\nTotal number of nodes: 166\n\nNumber of nodes by tree:\nCount: 10 Average: 16.6 StdDev: 1.74356\nMin: 13 Max: 19 Ignored: 0\n----------------------------------------------\n[ 13, 14) 1 10.00% 10.00% ##\n[ 14, 15) 0 0.00% 10.00%\n[ 15, 16) 2 20.00% 30.00% ####\n[ 16, 17) 0 0.00% 30.00%\n[ 17, 18) 5 50.00% 80.00% ##########\n[ 18, 19) 0 0.00% 80.00%\n[ 19, 19] 2 20.00% 100.00% ####\n\nDepth by leafs:\nCount: 88 Average: 3.65909 StdDev: 1.22369\nMin: 1 Max: 6 Ignored: 0\n----------------------------------------------\n[ 1, 2) 4 4.55% 4.55% #\n[ 2, 3) 11 12.50% 17.05% ####\n[ 3, 4) 23 26.14% 43.18% ########\n[ 4, 5) 29 32.95% 76.14% ##########\n[ 5, 6) 15 17.05% 93.18% #####\n[ 6, 6] 6 6.82% 100.00% ##\n\nNumber of training obs by leaf:\nCount: 88 Average: 6.81818 StdDev: 1.79358\nMin: 5 Max: 14 Ignored: 0\n----------------------------------------------\n[ 5, 6) 28 31.82% 31.82% ##########\n[ 6, 7) 17 19.32% 51.14% ######\n[ 7, 8) 11 12.50% 63.64% ####\n[ 8, 9) 19 21.59% 85.23% #######\n[ 9, 10) 7 7.95% 93.18% ###\n[ 10, 11) 4 4.55% 97.73% #\n[ 11, 12) 0 0.00% 97.73%\n[ 12, 13) 1 1.14% 98.86%\n[ 13, 14) 0 0.00% 98.86%\n[ 14, 14] 1 1.14% 100.00%\n\nAttribute in nodes:\n\t29 : col_float [NUMERICAL]\n\t26 : col_int [NUMERICAL]\n\t12 : col_bool [BOOLEAN]\n\t11 : col_cat_2 [CATEGORICAL]\n\nAttribute in nodes with depth <= 0:\n\t6 : col_int [NUMERICAL]\n\t3 : col_float [NUMERICAL]\n\t1 : col_cat_2 [CATEGORICAL]\n\nAttribute in nodes with depth <= 1:\n\t12 : col_int [NUMERICAL]\n\t9 : col_float [NUMERICAL]\n\t4 : col_bool [BOOLEAN]\n\t1 : col_cat_2 [CATEGORICAL]\n\nAttribute in nodes with depth <= 2:\n\t20 : col_float [NUMERICAL]\n\t16 : col_int [NUMERICAL]\n\t7 : col_cat_2 [CATEGORICAL]\n\t4 : col_bool [BOOLEAN]\n\nAttribute in nodes with depth <= 3:\n\t28 : col_float [NUMERICAL]\n\t19 : col_int [NUMERICAL]\n\t10 : col_bool [BOOLEAN]\n\t9 : col_cat_2 [CATEGORICAL]\n\nAttribute in nodes with depth <= 5:\n\t29 : col_float [NUMERICAL]\n\t26 : col_int [NUMERICAL]\n\t12 : col_bool [BOOLEAN]\n\t11 : col_cat_2 [CATEGORICAL]\n\nCondition type in nodes:\n\t55 : HigherCondition\n\t12 : TrueValueCondition\n\t11 : ContainsBitmapCondition\nCondition type in nodes with depth <= 0:\n\t9 : HigherCondition\n\t1 : ContainsBitmapCondition\nCondition type in nodes with depth <= 1:\n\t21 : HigherCondition\n\t4 : TrueValueCondition\n\t1 : ContainsBitmapCondition\nCondition type in nodes with depth <= 2:\n\t36 : HigherCondition\n\t7 : ContainsBitmapCondition\n\t4 : TrueValueCondition\nCondition type in nodes with depth <= 3:\n\t47 : HigherCondition\n\t10 : TrueValueCondition\n\t9 : ContainsBitmapCondition\nCondition type in nodes with depth <= 5:\n\t55 : HigherCondition\n\t12 : TrueValueCondition\n\t11 : ContainsBitmapCondition\nNode format: NOT_SET\n\nTraining OOB:\n\ttrees: 1, Out-of-bag evaluation: accuracy:0.0952381 logloss:32.6109\n\ttrees: 10, Out-of-bag evaluation: accuracy:0.166667 logloss:17.4497\n'
ber of predictions (with weights): 60 Task: CLASSIFICATION Label: col_cat_1 Accuracy: 0.166667 CI95[W][0.0933069 0.266291] LogLoss: : 17.4497 ErrorRate: : 0.833333 Default Accuracy: : 0.333333 Default LogLoss: : 1.09861 Default ErrorRate: : 0.666667 Confusion Table: truth\prediction <OOD> a b c <OOD> 0 0 0 0 a 0 5 5 10 b 0 12 4 4 c 0 18 1 1 Total: 60 One vs other classes:
Feature SelectionĀ¶
YDF offers many ways to customize which features to use and how to use them.
When specifying the learner, we can manually select a subset of the features.
model_2 = ydf.RandomForestLearner(label="col_cat_1", num_trees=10, features=["col_int", "col_bool"]).train(df)
print(model_2)
WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.
Model: RANDOM_FOREST Task: CLASSIFICATION Class: ydf.RandomForestModel Use `model.describe()` for more details
[INFO 23-10-31 18:13:26.2825 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size. [INFO 23-10-31 18:13:26.2829 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features. [INFO 23-10-31 18:13:26.2829 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set. [INFO 23-10-31 18:13:26.2829 UTC random_forest.cc:416] Training random forest on 60 example(s) and 2 feature(s). [INFO 23-10-31 18:13:26.2853 UTC random_forest.cc:802] Training of tree 1/10 (tree index:0) done accuracy:0.142857 logloss:30.8946 [INFO 23-10-31 18:13:26.2855 UTC random_forest.cc:802] Training of tree 10/10 (tree index:9) done accuracy:0.15 logloss:21.5379 [INFO 23-10-31 18:13:26.2866 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.15 logloss:21.5379
Forcing a semanticĀ¶
We can also force a semantic on a certain feature. Here, we force the integer column to be treated as categorical. Note that we set include_all_columns
to make sure even columns not explicitly listed are used.
It is not possible to force arbitrary semantics to the columns. Categorical features must be integer or string, while numerical columns must be float or integer.
Note: Internally, YDF converts all numerical columns to 32-bit floats. It is therefore not necessary to perform conversions between numerical formats.
model_3 = ydf.RandomForestLearner(
label="col_cat_1",
num_trees=10, # Compute only 10 trees
features=[ydf.Feature("col_int", semantic=ydf.Semantic.CATEGORICAL)],
include_all_columns=True # Include all columns, not just the ones listed in "features"
).train(df)
print(model_3)
WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.
Model: RANDOM_FOREST Task: CLASSIFICATION Class: ydf.RandomForestModel Use `model.describe()` for more details
[INFO 23-10-31 18:13:26.2941 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size. [INFO 23-10-31 18:13:26.2947 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features. [INFO 23-10-31 18:13:26.2948 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set. [INFO 23-10-31 18:13:26.2948 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s). [INFO 23-10-31 18:13:26.2974 UTC random_forest.cc:802] Training of tree 1/10 (tree index:0) done accuracy:0.0952381 logloss:32.6109 [INFO 23-10-31 18:13:26.2976 UTC random_forest.cc:802] Training of tree 10/10 (tree index:7) done accuracy:0.101695 logloss:21.2695 [INFO 23-10-31 18:13:26.2987 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.101695 logloss:21.2695
Fine-grained semanticsĀ¶
YDF creates a dictionary for processing categorical features quickly. It has been shown that models often generalize better when rare features subsumed as "Out-of-dictionary" (OOD) values. As usual, YDF provides sensible default values: Each value appearing less than 5 times is considered OOD, and there can be at most 2000 non-OOD values. These default values can be changed in the model constructor.
model_4 = ydf.RandomForestLearner(
label="col_cat_1",
num_trees=10, # Compute only 10 trees
max_vocab_count=300, # Allow at most 300 non-OOD values.
min_vocab_frequency=3, # Any value appearing less than 3 times is considered OOD.
features=[ydf.Feature("col_int", semantic=ydf.Semantic.CATEGORICAL)],
include_all_columns=True
).train(df)
print(model_4)
WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.
Model: RANDOM_FOREST Task: CLASSIFICATION Class: ydf.RandomForestModel Use `model.describe()` for more details
[INFO 23-10-31 18:13:26.3056 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size. [INFO 23-10-31 18:13:26.3061 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features. [INFO 23-10-31 18:13:26.3061 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set. [INFO 23-10-31 18:13:26.3061 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s). [INFO 23-10-31 18:13:26.3084 UTC random_forest.cc:802] Training of tree 1/10 (tree index:1) done accuracy:0.136364 logloss:31.1286 [INFO 23-10-31 18:13:26.3086 UTC random_forest.cc:802] Training of tree 10/10 (tree index:8) done accuracy:0.133333 logloss:19.1921 [INFO 23-10-31 18:13:26.3098 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.133333 logloss:19.1921
Fine-grained semantics can even be specified on individual features
explicit_features = [
ydf.Feature("col_cat_1",
min_vocab_frequency=1, # No minimum frequency for elements of this feature.
semantic=ydf.Semantic.CATEGORICAL # Required when setting min_vocab_frequency.
),
"col_cat_2", # It is not necessary to provide detailed semantics for all features.
"col_bool"
]
model_explicit_semantics = ydf.RandomForestLearner(
label="col_int",
num_trees=10, # Compute only 10 trees
min_vocab_frequency=3, # Any value appearing less than 3 times is considered OOD by default.
features=explicit_features,
include_all_columns=False
).train(df)
print(model_explicit_semantics)
WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.
Model: RANDOM_FOREST Task: CLASSIFICATION Class: ydf.RandomForestModel Use `model.describe()` for more details
[INFO 23-10-31 18:13:26.3164 UTC dataset.cc:299] max_vocab_count = -1 for column col_int, the dictionary will not be pruned by size. [INFO 23-10-31 18:13:26.3169 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features. [INFO 23-10-31 18:13:26.3169 UTC abstract_learner.cc:141] The label "col_int" was removed from the input feature set. [INFO 23-10-31 18:13:26.3169 UTC random_forest.cc:416] Training random forest on 60 example(s) and 3 feature(s). [INFO 23-10-31 18:13:26.3194 UTC random_forest.cc:802] Training of tree 1/10 (tree index:0) done accuracy:0 logloss:36.0437 [INFO 23-10-31 18:13:26.3199 UTC random_forest.cc:802] Training of tree 10/10 (tree index:9) done accuracy:0 logloss:36.0437 [INFO 23-10-31 18:13:26.3208 UTC random_forest.cc:882] Final OOB metrics: accuracy:0 logloss:36.0437
Advanced: Creating a VerticalDatasetĀ¶
Internally, YDF uses a data structure called VerticalDataset
for storing training dataset. Normally, the VerticalDataset is created automatically during training. It is also possible to explicitly create the VerticalDataset. This can be useful when re-using the same dataset multiple times, since we can save re-converting the dataset from pandas.
vds = ydf.create_vertical_dataset(
df,
# Columns and their semantics can be specified the same way
# features are specified for learners
columns=["col_cat_1", "col_int", "col_bool"]
)
vds.memory_usage() # Prints memory usage in bytes.
540
A VerticalDataset also contains a DataSpecification, which collects all information about the dataset that is used during training: Semantics for each column, dictionary of categorical features, statistical information about numerical features and more.
vds.data_spec() # Print the data spec.
columns { type: CATEGORICAL name: "col_cat_1" categorical { number_of_unique_values: 4 items { key: "c" value { index: 3 count: 20 } } items { key: "b" value { index: 2 count: 20 } } items { key: "a" value { index: 1 count: 20 } } items { key: "<OOD>" value { index: 0 count: 0 } } } count_nas: 0 } columns { type: NUMERICAL name: "col_int" numerical { mean: 29.5 min_value: 0 max_value: 59 standard_deviation: 17.318102282486574 } count_nas: 0 } columns { type: BOOLEAN name: "col_bool" count_nas: 0 boolean { count_true: 30 count_false: 30 } } created_num_rows: 60