Feature Semantics¶
When training a model, YDF needs to understand how to interpret the feature values in the training data. A feature might, for example, be a numerical quantity, a category or a set of tags. The interpretation of a featureās values is called feature semantic.
The semantic of a feature is related to, but different from, the featureās representation, that is, the (technical) data type of the feature. For example, a feature represented by 64-bit integers might have a numerical semantic or a categorical semantic.
In basic cases, YDF detects the semantics of a feature automatically, so you only need to check them after the training (e.g. use model.describe()). If YDF does not detect the correct semantic, you can manually override it. Using the wrong semantic negatively impacts the training speed and quality of a model. Also, YDF is not able to consume all the types of features. In such cases, features need to be pre-preprocessed into a supported semantic.
This guide explains the different semantics available in YDF, how to check / select them, and gives recommendations on how to feed different types of features into the model.
This guide assumes basic familiarity with YDF, e.g. the Getting Started tutorial.
Introduction: How to specify feature semantics¶
Unless it is given additional information, YDF automatically determines the feature semantics when training the model:
model = ydf.RandomForestLearner(label="label").train(ds)
# The "Dataspec" tab of the model description shows the feature semantics
# used for this model.
model.describe()
It is possible to override the feature semantics manually using the
ydf.Semantic
enum:
model = ydf.RandomForestLearner(
features=[("f1", ydf.Semantic.NUMERICAL), ("f2", ydf.Semantic.CATEGORICAL)],
include_all_columns=True, # Also use not explicitly defined features.
label="label"
).train(ds)
model.describe()
Currently, YDF supports 5 input feature semantics. New semantics are added from time to time following our research. The semantics are:
ydf.Semantic.NUMERICAL
ydf.Semantic.CATEGORICAL
ydf.Semantic.BOOLEAN
ydf.Semantic.CATEGORICAL_SET
ydf.Semantic.DISCRETIZED_NUMERICAL
The next section will explain the individual semantics in more detail.
Feature semantics¶
ydf.Semantic.NUMERICAL¶
NUMERICAL features represent quantities, amounts, or more generally, any ordered values. For example, age (in years), duration (in seconds), net worth (in dollars), number of requests (in count), scores (in points), and even the median of a distribution, are NUMERICAL features.
YDF automatically recognizes integer and floating-point values as NUMERICAL.
# A dataset with 4 numerical features.
dataset = {
"age": np.array([1, 55, 24, 8]),
"number of cats": np.array([0, 10, 4, 2]),
"net worth": np.array([0.0, 123456.78, -4000.0, 315.42]),
"score": np.array([1.1, 5.2, math.nan, 1.2]),
}
ydf.Semantic.CATEGORICAL¶
CATEGORICAL features represent categories, enum values, tags, or more generally, any unordered values. For example, species (among cat, bird, or fish), blood type (among A, B, AB, O), country, language, and project status (in planning, in progress, done, canceled).YDF automatically recognizes string features as CATEGORICAL.
Additional considerations for categorical features are:
- Donāt bucket: With Neural networks, numerical features are sometimes bucketed into categorical brackets (e.g., 0-5, 5-10, 10-20). This is not beneficial for YDF. If you can, feed directly the value as numerical. If you only have the bucketed data, also feed it as a numerical feature.
- Donāt use one-hot-encoding: One-hot encoding for categorical features consistently underperforms when using tree algorithms and should not be used in YDF.
- Preprocessing: To avoid overfitting, YDF automatically replaces rare
categorical values with OOD (āout of dictionaryā). This generally leads
to better models. This behavior is controlled by hyperparameters
max_vocab_count
(to limit the number of categories) andmin_vocab_frequency
(to prune rare categories). - Unknown Values: During inference, any unknown categorical value is treated as OOD which is distinct from a missing value. For example, consider a model with a feature āspeciesā that has values ācatā, ābirdā and āfishā in the training dataset. If, during model inference, an instance has value ātigerā for the āspeciesā feature, the model will implicitly transform ātigerā to the OOD token.
-
Python Enums: Numerical Python enums should generally have a categorical semantic, but, as they are integers, are automatically recognized as NUMERICAL. Enums should therefore be specified manually as CATEGORICAL.
# A dataset with 4 categorical features. dataset = { "species": np.array(["cat", "bird", "bird", "fish", "fish"]), "country": np.array(["US", "US", "Switzerland", "India", "India"]), "month": np.array([1, 1, 4, 6, 1]), # CATEGORICAL features can be integer. "blood type": np.array(["A", "B", "AB", "B", ""]), # Missing values are empty. } model = ydf.RandomForestLearner( label="blood type", # Since integers are auto-detected as NUMERICAL, specify the semantic manually. features=[("month", ydf.Semantic.CATEGORICAL)], min_vocab_frequency=2, # Prune vocabulary items that appears only once ).train(dataset) # Check that all features and the label are CATEGORICAL. # Check model.data_spec() column to see the feature's categories. model.describe()
ydf.Semantic.BOOLEAN¶
The value of a Boolean features can only be true, false, or missing. Examples are āhas subscribedā, āis spamā, āis in stockā etc. Boolean features are a special case of categorical features. YDF automatically recognizes boolean features as BOOLEAN for most dataset formats.
The label of the model can never be BOOLEAN. Note that binary classification uses CATEGORICAL labels.
# A dataset with 3 boolean features.
dataset = {
"has subscribed": np.array([True, False, False, True]),
"spam": np.array([1,0,1,1], dtype=bool), # Ensure the dtype is not integer.
"happy": np.array([True, False, False, True]),
}
Warning
Avoid IDs as features: Many datasets have features of type "ID" / "identifier", "unique_hash", etc. that are (nearly) unique for each example in the dataset. These features are not useful for training a machine learning model, they slow down model training and might increase model size. It is therefore important to remove these values from the dataset before training.
Special semantics¶
ydf.Semantic.DISCRETIZED_NUMERICAL¶
DISCRETIZED_NUMERICAL is not really a new semantic. Instead, it is used to
tell the learning algorithm to optimize training with a special
discretization algorithm. Any NUMERICAL feature can be configured as
DISCRETIZED_NUMERICAL. Training will be faster (generally ~2x) but it can
hurt the model quality. Setting all the NUMERICAL features as
DISCRETIZED_NUMERICAL is equivalent to setting hyperparameter
detect_numerical_as_discretized_numerical=True
.
data = {
"age": np.array([1, 55, 24, 8]),
"net worth": np.array([0.0, 123456.78, -4000.0, 315.42]),
"weight": np.array([9, 63, 70, np.nan]),
}
model = ydf.RandomForestLearner(
label="weight",
discretize_numerical_columns=False, # Default
features=[("net worth", ydf.Semantic.DISCRETIZED_NUMERICAL)],
task=ydf.Task.REGRESSION,
).train(data)
# `net worth` is DISCRETIZED_NUMERICAL, `age` and `weight` are NUMERICAL.
model.describe()
ydf.Semantic.CATEGORICAL_SET¶
The value of a categorical-set feature is a set of categorical values. In other words, while a ydf.Semantic.CATEGORICAL can only have one value, a ydf.Semantic.CATEGORICAL_SET feature can have none, one, or many categorical values.. Use this for sets of discrete values, such as tokenized text or tag sets (e.g. a webpage talks about {politics, elections, united-states}).
When text features, the tokenization is important to consider. Splitting on spaces works okay in English, but poorly in Chinese. For example, "A cat sits on a tree" becomes {a, cat, on, sits, tree}. Using a more powerful tokenizer might be better. Since a set does not encode position, {a, cat, on, sits, tree} is equivalent to {a, tree, sits, on, cat}. One solution is to use multi-grams (e.g., bi-grams) that encode consecutive works. For example, the bi-grams in our example are {a_cat, cat_sit, sit_on, on_a, a_tree}. \
- Preprocessing: As with CATEGORICAL features, YDF automatically
replaces rare categorical values with OOD (āout of dictionaryā) for
CATEGORICAL_SET. This behavior is controlled by hyperparameters
max_vocab_count
(to limit the number of categories) andmin_vocab_frequency
(to prune rare categories). - Training speed: CATEGORICAL_SET features are slower to train than NUMERICAL or CATEGORICAL features. Don't use CATEGORICAL_SET instead of CATEGORICAL features (the result will be the same, but the model will train slowly).
- Tokenization: When using CATEGORICAL_SET for text features, the text must be tokenized before it is fed to YDF. CSV files are tokenized by whitespace automatically, see the section on CSV files for details.
# A dataset with 2 categorical set features and one categorical feature.
dataset = {
"title": [["Next", "week", "are", "us", "elections"], ["Reform", "started", "this", "month"], ["Funniest", "politics", "speeches"]],
"tags": [["politics", "election"], ["politics"], ["funny", "politics"]],
"interesting": ["yes", "yes", "no"],
}
model = ydf.RandomForestLearner(
label="interesting",
min_vocab_frequency=1, # Don't prune the categories.
features=[("title", ydf.Semantic.CATEGORICAL_SET), ("tags", ydf.Semantic.CATEGORICAL_SET)],
).train(dataset)
Multi-dimensional features¶
YDF supports constant-size vectors as
features. This is typically used when dealing with vector embeddings. For
example, consider a text feature text
that is transformed (during
preprocessing) with a Universal Sentence Encoder model to a numerical vector
of 512 entries. This vector can be fed directly to YDF.
YDF āunrollsā each entry of the vector to an individual feature. These
features are named text.0_of_512
, text.1_of_512
, etc. Note that all
vectors must have the exact same size - if the vectors have different sizes,
consider the CATEGORICAL_SET semantic. See
here
for a more detailed example.
# A dataset with a two-dimensional numerical feature
# and a two-dimensional categorical feature.
dataset = {
"categorical_vector": np.array([["a", "b"], ["a", "c"], ["b", "c"]]),
"numerical_embeeding": np.array([[1, 2], [3, 4], [5, 6]]),
"label": np.array([1, 2, 1]),
}
model = ydf.RandomForestLearner(
label="label",
).train(dataset)
# `categorical_vector` is unrolled to two CATEGORICAL features:
# `categorical_vector.0_of_2` and `categorical_vector.1_of_2`.
# `numerical_embeeding` is unrolled to two NUMERICAL features:
# `numerical_embeeding.0_of_2` and `numerical_embeeding.1_of_2`.
model.describe()
Note
ydf.Semantic.HASH is used internally only and cannot be used for decision tree training.
Not natively supported semantics¶
There are some features that YDF cannot consume natively, but instead need to be preprocessed. This section details the most common scenarios.
Timestamps¶
Timestamps should be converted to the NUMERICAL semantic. A popular choice is to decompose the timestamp into calendar features such as day of the week, week of the month, hour of the day, etc.. Simply converting a timestamps into a numerical unix time generally does not work well.
Warning
The presence of timestamps often indicates that the dataset is containing time-series data (see next section).
Time-series¶
Time series datasets require advanced feature preprocessing for good model quality. Check the special guide in the YDF documentation for more information.
Warning
Time series require careful modeling to prevent future leakage or poor model quality. For complex problems, using a preprocessing tool such as Temporian is recommended.
Repeated proto messages¶
Repeated proto messages must be āflattenedā to individual features. Repeated numerical entries, creating statistical features such as maximum, minimum, mean, median, variance, ā¦ is useful. Repeated categorical entries can be transformed to CATEGORICAL_SET features.
Images¶
Decision Forest models are not state-of-the-art model architecture for image processing. In many cases, exploring other model architectures (notably neural networks) should be preferred.
Details by dataset format¶
Numpy¶
Scalar Types
The following table shows which scalar Numpy data types correspond to which YDF semantics and which casts can be performed by YDF after manually specifying a feature semantic.
Numpy\YDF | NUMERICAL [3] | CATEGORICAL | BOOLEAN | CATEGORICAL SET [6] | DISCRETIZED NUMERICAL |
---|---|---|---|---|---|
int [1] | Default | Cast [4] | No support | No support | Cast |
float [2] | Default | No support | No support | No support | Cast |
bool | Cast | Cast [5] | Default | No support | Cast |
str | No support | Default | No support | No support | No support |
bytes | No support | Default | No support | No support | No support |
[1]: Includes unsigned integers
[2]: float128 is not supported
[3]: YDF internally casts numerical values to float32.
[4]: Internally, the values are cast to string and sorted by frequency. Label values are ordered lexicographically (for strings) or increasing (for integers).
[5]: Internally, the values are cast to āfalseā and ātrueā, with āfalseā coming first.
[6]: Use type object
for support for CATEGORICAL SET, see below.
Two-dimensional arrays (i.e. matrices)
YDF unrolls two-dimensional arrays(i.e. matrices) by column.
Object
YDF inspects features given as numpy arrays of type object
and treats them
differently based on their content.
If the arrayās first element is a scalar type, YDF attempts to cast the
array to type bytes
and treat it as CATEGORICAL.
If the arrayās first element is a Python list or Numpy array (irrespective
of dtype), YDF checks if it contains only lists (arrays) and fails otherwise. If
all lists have the same length, YDF attempts to cast the sub-lists (sub-arrays)
to type np.bytes
and treat the entire feature as a matrix of CATEGORICAL
features. This matrix is then unrolled into individual features.
If the sub-lists (sub-arrays) have different sizes, YDF attempts to cast the
sub-lists (sub-arrays) to type np.bytes
and treat the entire feature with
semantic CATEGORICAL_SET.
Any other types or combinations of types are not supported.
Missing values
Missing NUMERICAL values are np.Nan
, missing CATEGORICAL values are empty
strings. Missing BOOLEAN or CATEGORICAL_SET values cannot be represented.
Python lists¶
YDF can consume Python lists as features. However, automatic semantic detection is not enabled for Python lists. Furthermore, multi-dimensional features (except CATEGORICAL_SET) cannot be fed with Python lists.
CSV¶
Automatic Semantic detection
- Columns with only 0 and 1 are recognized as BOOLEAN.
- Columns with only numeric values are recognized as NUMERICAL.
- Other columns are recognized as CATEGORICAL (see below).
- Multidimensional features are not supported.
Example code:
"""!cat mycsv.csv
num,bool,cat,catset,label
1.0,1,a,x y,1
1.5,0,b,y z,2
2.0,1,1,x y z,3"""
model = ydf.RandomForestLearner(
label="label",
# Note that CATEGORICAL_SET columns are tokenized by whitespace.
features=[("catset", ydf.Semantic.CATEGORICAL_SET)],
min_vocab_frequency=1,
include_all_columns=True,
).train("csv:mycsv.csv")
# Column "num" is NUMERICAL.
# Column "bool" is BOOLEAN
# Column "cat" is CATEGORICAL
# Column "catset" is CATEGORICAL_SET (as specified).
model.describe()
Warning
Only the first 100000 rows are inspected to determine a column's type. Adapt max_num_scanned_rows_to_infer_semantic to increase this value.
Automatic tokenization
When reading from CSV files, YDF tokenizes features with semantic CATEGORICAL_SET by splitting along whitespace.
Some YDF surfaces may automatically infer the type of string columns containing whitespace as having semantic CATEGORICAL_SET. Note that the Python API of YDF does not automatically infer type CATEGORICAL_SET, even if reading CSV files.
Missing values
Missing values are represented by string na
or empty strings.
Avro¶
Automatic Semantic detection
Avro columns are typed, so a columnās type which YDF is used by default.
- Boolean columns are recognized as BOOLEAN
- Long, Int, Float and Double columns are recognized as NUMERICAL
- Bytes and String columns are recognized as CATEGORICAL
- Array columns are either unrolled based on the type of data in the array. Arrays of Bytes or Strings may furthermore have type CATEGORICAL_SET.
- Nested arrays are currently not supported but may be supported in the future.
Advanced: How decision forests use feature semantics¶
This section is not required for using YDF, but it might offer additional context on the individual semantics and how they are used by common decision forest learning algorithms.
Recall that decision trees recursively split the dataset until a stopping condition is reached. The feature semantics both dictate which types of splits are considered, and how the algorithm finds the best split.
-
NUMERICAL: The learning algorithm creates splits in the data based on thresholds (e.g., "age >= 30").
For even more powerful models, enable oblique splits. This allows YDF to learn splits that combine multiple numerical features (e.g., "0.3 * age + 0.7 * income >= 50"). This is particularly helpful for smaller datasets but requires more training time.
-
CATEGORICAL: The learning algorithm creates splits by grouping values into sets (e.g., "country in {USA, Switzerland, Luxemburg}"). This is more expressive than numerical splits but computationally more expensive.
-
BOOLEAN: This semantic provides a memory- and speed-optimized way to handle boolean data compared to using semantics NUMERICAL or CATEGORICAL.
CATEGORICAL_SET: The learning algorithm creates splits based on set intersections (e.g., "
{a, cat, on, sits, tree}
intersects{cat, dog, bird}
"). This semantic is computationally expensive and can be slow to train. For in-depth information about categorical sets, see Guillame-Bert et al., 2020.
Missing Values¶
YDF effectively handles missing values, which are represented differently for each semantic. During training, YDF uses global imputation:
- NUMERICAL and DISCRETIZED_NUMERICAL: Missing values are replaced with the mean of the feature.
- CATEGORICAL and BOOLEAN: Missing values are replaced with the most frequent value.
- CATEGORICAL_SET: Missing values are always routed to the negative branch of a split.
If the hyperparameter allow_na_conditions
is enabled, the learning algorithm
can also create splits of the form āfeature is NAā. Note that this is usually
not necessary: Global imputation replaces missing values with a āspecialā value,
which is quickly learned by the algorithm.
Note
Missing values are not allowed in the label column.
Label column semantic¶
The semantic of the label column is determined by the model task:
- REGRESSION: Requires NUMERICAL labels.
- CLASSIFICATION: Requires CATEGORICAL labels.
- RANKING: Requires NUMERICAL labels with integer values.
- CATEGORICAL_UPLIFT: Requires NUMERICAL labels.
- NUMERICAL_UPLIFT: Requires NUMERICAL labels.
Changing the task changes the model's loss function and output, resulting in a fundamentally different model. If the label column cannot be interpreted with the chosen taskās semantic, model training fails.
The data spec¶
Internally, YDF models store information about the feature semantics in the
"data spec". A model's data spec is a proto message that can be displayed with
model.data_spec()
. The data spec also contains information about
the feature values seen during training. Some of this information may be used
during training and inference. In particular, the data spec stores the following
information:
- For NUMERICAL and DISCRETIZED_NUMERICAL features, statistical information and, if relevant, bucketization.
- For CATEGORICAL and CATEGORICAL_SET features, the dictionary of values seen during training and their frequency. Note that the ordering is not guaranteed to be stable across versions and should not be relied upon.
- For BOOLEAN values, the frequency of true and false values seen during training.
A summary of the data spec is shown in model.describe()
. The raw data spec
is mainly useful for debugging issues with YDF.