Pandas¶

Setup¶

In [1]:

Copied!

pip install ydf -U
pip install ydf -U

Pandas¶

YDF can train directly on Pandas dataframes. YDF tries to infer column semantics automatically. For more fine-grained control, YDF offers advanced options for specifying column semantics.

In [2]:

Copied!





# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd
import numpy as np

# Create a small dataframe with different column types.
df = pd.DataFrame(
    {"col_cat_1": ["a", "b", "c"]*20,
     "col_cat_2": ["x", "x", "x", "y", "y", "y"]*9 + ["q", "q", "w", "w", "r", "r"],
     "col_int": list(range(60)),
     "col_float": np.linspace(0,1,60),
     "col_bool": [True, False]*30
})
df.head()
# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd
import numpy as np

# Create a small dataframe with different column types.
df = pd.DataFrame(
    {"col_cat_1": ["a", "b", "c"]*20,
     "col_cat_2": ["x", "x", "x", "y", "y", "y"]*9 + ["q", "q", "w", "w", "r", "r"],
     "col_int": list(range(60)),
     "col_float": np.linspace(0,1,60),
     "col_bool": [True, False]*30
})
df.head()

Out[2]:

	col_cat_1	col_cat_2	col_int	col_float	col_bool
0	a	x	0	0.000000	True
1	b	x	1	0.016949	False
2	c	x	2	0.033898	True
3	a	y	3	0.050847	False
4	b	y	4	0.067797	True

We can directly train a model on this dataframe.

In [3]:

Copied!

model_1 = ydf.RandomForestLearner(label="col_cat_1", num_trees=10).train(df)
# See the data specification in the dataspec tab.
model_1.describe()
model_1 = ydf.RandomForestLearner(label="col_cat_1", num_trees=10).train(df)
# See the data specification in the dataspec tab.
model_1.describe()

WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.
[INFO 23-10-31 18:13:26.2598 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.2622 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.2623 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set.
[INFO 23-10-31 18:13:26.2623 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s).
[INFO 23-10-31 18:13:26.2694 UTC random_forest.cc:802] Training of tree  1/10 (tree index:0) done accuracy:0.0952381 logloss:32.6109
[INFO 23-10-31 18:13:26.2696 UTC random_forest.cc:802] Training of tree  10/10 (tree index:8) done accuracy:0.166667 logloss:17.4497
[INFO 23-10-31 18:13:26.2708 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.166667 logloss:17.4497
[INFO 23-10-31 18:13:26.2721 UTC abstract_model.cc:881] Model self evaluation:
Number of predictions (without weights): 60
Num

Out[3]:

'Type: "RANDOM_FOREST"\nTask: CLASSIFICATION\nLabel: "col_cat_1"\n\nInput Features (4):\n\tcol_cat_2\n\tcol_int\n\tcol_float\n\tcol_bool\n\nNo weights\n\nVariable Importance: INV_MEAN_MIN_DEPTH:\n    1.   "col_int"  0.532151 ################\n    2. "col_float"  0.411677 #########\n    3. "col_cat_2"  0.254803 \n    4.  "col_bool"  0.239653 \n\nVariable Importance: NUM_AS_ROOT:\n    1.   "col_int"  6.000000 ################\n    2. "col_float"  3.000000 ######\n    3. "col_cat_2"  1.000000 \n\nVariable Importance: NUM_NODES:\n    1. "col_float" 29.000000 ################\n    2.   "col_int" 26.000000 #############\n    3.  "col_bool" 12.000000 \n    4. "col_cat_2" 11.000000 \n\nVariable Importance: SUM_SCORE:\n    1.  "col_bool" 73.616851 ################\n    2. "col_float" 70.501937 ##############\n    3.   "col_int" 59.014978 ##########\n    4. "col_cat_2" 32.388708 \n\n\n\nWinner takes all: true\nOut-of-bag evaluation: accuracy:0.166667 logloss:17.4497\nNumber of trees: 10\nTotal number of nodes: 166\n\nNumber of nodes by tree:\nCount: 10 Average: 16.6 StdDev: 1.74356\nMin: 13 Max: 19 Ignored: 0\n----------------------------------------------\n[ 13, 14) 1  10.00%  10.00% ##\n[ 14, 15) 0   0.00%  10.00%\n[ 15, 16) 2  20.00%  30.00% ####\n[ 16, 17) 0   0.00%  30.00%\n[ 17, 18) 5  50.00%  80.00% ##########\n[ 18, 19) 0   0.00%  80.00%\n[ 19, 19] 2  20.00% 100.00% ####\n\nDepth by leafs:\nCount: 88 Average: 3.65909 StdDev: 1.22369\nMin: 1 Max: 6 Ignored: 0\n----------------------------------------------\n[ 1, 2)  4   4.55%   4.55% #\n[ 2, 3) 11  12.50%  17.05% ####\n[ 3, 4) 23  26.14%  43.18% ########\n[ 4, 5) 29  32.95%  76.14% ##########\n[ 5, 6) 15  17.05%  93.18% #####\n[ 6, 6]  6   6.82% 100.00% ##\n\nNumber of training obs by leaf:\nCount: 88 Average: 6.81818 StdDev: 1.79358\nMin: 5 Max: 14 Ignored: 0\n----------------------------------------------\n[  5,  6) 28  31.82%  31.82% ##########\n[  6,  7) 17  19.32%  51.14% ######\n[  7,  8) 11  12.50%  63.64% ####\n[  8,  9) 19  21.59%  85.23% #######\n[  9, 10)  7   7.95%  93.18% ###\n[ 10, 11)  4   4.55%  97.73% #\n[ 11, 12)  0   0.00%  97.73%\n[ 12, 13)  1   1.14%  98.86%\n[ 13, 14)  0   0.00%  98.86%\n[ 14, 14]  1   1.14% 100.00%\n\nAttribute in nodes:\n\t29 : col_float [NUMERICAL]\n\t26 : col_int [NUMERICAL]\n\t12 : col_bool [BOOLEAN]\n\t11 : col_cat_2 [CATEGORICAL]\n\nAttribute in nodes with depth <= 0:\n\t6 : col_int [NUMERICAL]\n\t3 : col_float [NUMERICAL]\n\t1 : col_cat_2 [CATEGORICAL]\n\nAttribute in nodes with depth <= 1:\n\t12 : col_int [NUMERICAL]\n\t9 : col_float [NUMERICAL]\n\t4 : col_bool [BOOLEAN]\n\t1 : col_cat_2 [CATEGORICAL]\n\nAttribute in nodes with depth <= 2:\n\t20 : col_float [NUMERICAL]\n\t16 : col_int [NUMERICAL]\n\t7 : col_cat_2 [CATEGORICAL]\n\t4 : col_bool [BOOLEAN]\n\nAttribute in nodes with depth <= 3:\n\t28 : col_float [NUMERICAL]\n\t19 : col_int [NUMERICAL]\n\t10 : col_bool [BOOLEAN]\n\t9 : col_cat_2 [CATEGORICAL]\n\nAttribute in nodes with depth <= 5:\n\t29 : col_float [NUMERICAL]\n\t26 : col_int [NUMERICAL]\n\t12 : col_bool [BOOLEAN]\n\t11 : col_cat_2 [CATEGORICAL]\n\nCondition type in nodes:\n\t55 : HigherCondition\n\t12 : TrueValueCondition\n\t11 : ContainsBitmapCondition\nCondition type in nodes with depth <= 0:\n\t9 : HigherCondition\n\t1 : ContainsBitmapCondition\nCondition type in nodes with depth <= 1:\n\t21 : HigherCondition\n\t4 : TrueValueCondition\n\t1 : ContainsBitmapCondition\nCondition type in nodes with depth <= 2:\n\t36 : HigherCondition\n\t7 : ContainsBitmapCondition\n\t4 : TrueValueCondition\nCondition type in nodes with depth <= 3:\n\t47 : HigherCondition\n\t10 : TrueValueCondition\n\t9 : ContainsBitmapCondition\nCondition type in nodes with depth <= 5:\n\t55 : HigherCondition\n\t12 : TrueValueCondition\n\t11 : ContainsBitmapCondition\nNode format: NOT_SET\n\nTraining OOB:\n\ttrees: 1, Out-of-bag evaluation: accuracy:0.0952381 logloss:32.6109\n\ttrees: 10, Out-of-bag evaluation: accuracy:0.166667 logloss:17.4497\n'

ber of predictions (with weights): 60
Task: CLASSIFICATION
Label: col_cat_1

Accuracy: 0.166667  CI95[W][0.0933069 0.266291]
LogLoss: : 17.4497
ErrorRate: : 0.833333

Default Accuracy: : 0.333333
Default LogLoss: : 1.09861
Default ErrorRate: : 0.666667

Confusion Table:
truth\prediction
       <OOD>   a  b   c
<OOD>      0   0  0   0
    a      0   5  5  10
    b      0  12  4   4
    c      0  18  1   1
Total: 60

One vs other classes:

Feature Selection¶

YDF offers many ways to customize which features to use and how to use them.

When specifying the learner, we can manually select a subset of the features.

In [4]:

Copied!

model_2 = ydf.RandomForestLearner(label="col_cat_1", num_trees=10, features=["col_int", "col_bool"]).train(df)
print(model_2)
model_2 = ydf.RandomForestLearner(label="col_cat_1", num_trees=10, features=["col_int", "col_bool"]).train(df)
print(model_2)

WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.

Model: RANDOM_FOREST
Task: CLASSIFICATION
Class: ydf.RandomForestModel
Use `model.describe()` for more details

[INFO 23-10-31 18:13:26.2825 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.2829 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.2829 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set.
[INFO 23-10-31 18:13:26.2829 UTC random_forest.cc:416] Training random forest on 60 example(s) and 2 feature(s).
[INFO 23-10-31 18:13:26.2853 UTC random_forest.cc:802] Training of tree  1/10 (tree index:0) done accuracy:0.142857 logloss:30.8946
[INFO 23-10-31 18:13:26.2855 UTC random_forest.cc:802] Training of tree  10/10 (tree index:9) done accuracy:0.15 logloss:21.5379
[INFO 23-10-31 18:13:26.2866 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.15 logloss:21.5379

Forcing a semantic¶

We can also force a semantic on a certain feature. Here, we force the integer column to be treated as categorical. Note that we set include_all_columns to make sure even columns not explicitly listed are used.

It is not possible to force arbitrary semantics to the columns. Categorical features must be integer or string, while numerical columns must be float or integer.

Note: Internally, YDF converts all numerical columns to 32-bit floats. It is therefore not necessary to perform conversions between numerical formats.

In [5]:

Copied!





model_3 = ydf.RandomForestLearner(
    label="col_cat_1",
    num_trees=10,  # Compute only 10 trees
    features=[ydf.Feature("col_int", semantic=ydf.Semantic.CATEGORICAL)],
    include_all_columns=True  # Include all columns, not just the ones listed in "features"
).train(df)
print(model_3)
model_3 = ydf.RandomForestLearner(
    label="col_cat_1",
    num_trees=10,  # Compute only 10 trees
    features=[ydf.Feature("col_int", semantic=ydf.Semantic.CATEGORICAL)],
    include_all_columns=True  # Include all columns, not just the ones listed in "features"
).train(df)
print(model_3)

WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.

Model: RANDOM_FOREST
Task: CLASSIFICATION
Class: ydf.RandomForestModel
Use `model.describe()` for more details

[INFO 23-10-31 18:13:26.2941 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.2947 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.2948 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set.
[INFO 23-10-31 18:13:26.2948 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s).
[INFO 23-10-31 18:13:26.2974 UTC random_forest.cc:802] Training of tree  1/10 (tree index:0) done accuracy:0.0952381 logloss:32.6109
[INFO 23-10-31 18:13:26.2976 UTC random_forest.cc:802] Training of tree  10/10 (tree index:7) done accuracy:0.101695 logloss:21.2695
[INFO 23-10-31 18:13:26.2987 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.101695 logloss:21.2695

Fine-grained semantics¶

YDF creates a dictionary for processing categorical features quickly. It has been shown that models often generalize better when rare features subsumed as "Out-of-dictionary" (OOD) values. As usual, YDF provides sensible default values: Each value appearing less than 5 times is considered OOD, and there can be at most 2000 non-OOD values. These default values can be changed in the model constructor.

In [6]:

Copied!





model_4 = ydf.RandomForestLearner(
    label="col_cat_1", 
    num_trees=10,  # Compute only 10 trees
    max_vocab_count=300,  # Allow at most 300 non-OOD values.
    min_vocab_frequency=3,  # Any value appearing less than 3 times is considered OOD.
    features=[ydf.Feature("col_int", semantic=ydf.Semantic.CATEGORICAL)],
    include_all_columns=True
).train(df)
print(model_4)
model_4 = ydf.RandomForestLearner(
    label="col_cat_1", 
    num_trees=10,  # Compute only 10 trees
    max_vocab_count=300,  # Allow at most 300 non-OOD values.
    min_vocab_frequency=3,  # Any value appearing less than 3 times is considered OOD.
    features=[ydf.Feature("col_int", semantic=ydf.Semantic.CATEGORICAL)],
    include_all_columns=True
).train(df)
print(model_4)

WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.

Model: RANDOM_FOREST
Task: CLASSIFICATION
Class: ydf.RandomForestModel
Use `model.describe()` for more details

[INFO 23-10-31 18:13:26.3056 UTC dataset.cc:299] max_vocab_count = -1 for column col_cat_1, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.3061 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.3061 UTC abstract_learner.cc:141] The label "col_cat_1" was removed from the input feature set.
[INFO 23-10-31 18:13:26.3061 UTC random_forest.cc:416] Training random forest on 60 example(s) and 4 feature(s).
[INFO 23-10-31 18:13:26.3084 UTC random_forest.cc:802] Training of tree  1/10 (tree index:1) done accuracy:0.136364 logloss:31.1286
[INFO 23-10-31 18:13:26.3086 UTC random_forest.cc:802] Training of tree  10/10 (tree index:8) done accuracy:0.133333 logloss:19.1921
[INFO 23-10-31 18:13:26.3098 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.133333 logloss:19.1921

Fine-grained semantics can even be specified on individual features

In [7]:

Copied!





explicit_features = [
    ydf.Feature("col_cat_1", 
                min_vocab_frequency=1,  # No minimum frequency for elements of this feature.
                semantic=ydf.Semantic.CATEGORICAL  # Required when setting min_vocab_frequency.
               ),
    "col_cat_2",  # It is not necessary to provide detailed semantics for all features.
    "col_bool"
]
model_explicit_semantics = ydf.RandomForestLearner(
    label="col_int", 
    num_trees=10,  # Compute only 10 trees
    min_vocab_frequency=3,  # Any value appearing less than 3 times is considered OOD by default.
    features=explicit_features,
    include_all_columns=False
).train(df)
print(model_explicit_semantics)
explicit_features = [
    ydf.Feature("col_cat_1", 
                min_vocab_frequency=1,  # No minimum frequency for elements of this feature.
                semantic=ydf.Semantic.CATEGORICAL  # Required when setting min_vocab_frequency.
               ),
    "col_cat_2",  # It is not necessary to provide detailed semantics for all features.
    "col_bool"
]
model_explicit_semantics = ydf.RandomForestLearner(
    label="col_int", 
    num_trees=10,  # Compute only 10 trees
    min_vocab_frequency=3,  # Any value appearing less than 3 times is considered OOD by default.
    features=explicit_features,
    include_all_columns=False
).train(df)
print(model_explicit_semantics)

WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=128 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.

Model: RANDOM_FOREST
Task: CLASSIFICATION
Class: ydf.RandomForestModel
Use `model.describe()` for more details

[INFO 23-10-31 18:13:26.3164 UTC dataset.cc:299] max_vocab_count = -1 for column col_int, the dictionary will not be pruned by size.
[INFO 23-10-31 18:13:26.3169 UTC abstract_learner.cc:127] No input feature explicitly specified. Using all the available input features.
[INFO 23-10-31 18:13:26.3169 UTC abstract_learner.cc:141] The label "col_int" was removed from the input feature set.
[INFO 23-10-31 18:13:26.3169 UTC random_forest.cc:416] Training random forest on 60 example(s) and 3 feature(s).
[INFO 23-10-31 18:13:26.3194 UTC random_forest.cc:802] Training of tree  1/10 (tree index:0) done accuracy:0 logloss:36.0437
[INFO 23-10-31 18:13:26.3199 UTC random_forest.cc:802] Training of tree  10/10 (tree index:9) done accuracy:0 logloss:36.0437
[INFO 23-10-31 18:13:26.3208 UTC random_forest.cc:882] Final OOB metrics: accuracy:0 logloss:36.0437

Advanced: Creating a VerticalDataset¶

Internally, YDF uses a data structure called VerticalDataset for storing training dataset. Normally, the VerticalDataset is created automatically during training. It is also possible to explicitly create the VerticalDataset. This can be useful when re-using the same dataset multiple times, since we can save re-converting the dataset from pandas.

In [8]:

Copied!





vds = ydf.create_vertical_dataset(
    df,
    # Columns and their semantics can be specified the same way
    # features are specified for learners
    columns=["col_cat_1", "col_int", "col_bool"]
)
vds.memory_usage()  # Prints memory usage in bytes.
vds = ydf.create_vertical_dataset(
    df,
    # Columns and their semantics can be specified the same way
    # features are specified for learners
    columns=["col_cat_1", "col_int", "col_bool"]
)
vds.memory_usage()  # Prints memory usage in bytes.

Out[8]:

A VerticalDataset also contains a DataSpecification, which collects all information about the dataset that is used during training: Semantics for each column, dictionary of categorical features, statistical information about numerical features and more.

In [9]:

Copied!

vds.data_spec()  # Print the data spec.
vds.data_spec()  # Print the data spec.

Out[9]:

columns {
  type: CATEGORICAL
  name: "col_cat_1"
  categorical {
    number_of_unique_values: 4
    items {
      key: "c"
      value {
        index: 3
        count: 20
      }
    }
    items {
      key: "b"
      value {
        index: 2
        count: 20
      }
    }
    items {
      key: "a"
      value {
        index: 1
        count: 20
      }
    }
    items {
      key: "<OOD>"
      value {
        index: 0
        count: 0
      }
    }
  }
  count_nas: 0
}
columns {
  type: NUMERICAL
  name: "col_int"
  numerical {
    mean: 29.5
    min_value: 0
    max_value: 59
    standard_deviation: 17.318102282486574
  }
  count_nas: 0
}
columns {
  type: BOOLEAN
  name: "col_bool"
  count_nas: 0
  boolean {
    count_true: 30
    count_false: 30
  }
}
created_num_rows: 60