GenericModel
- GenericModel
- analyze
- analyze_prediction
- benchmark
- data_spec
- describe
- evaluate
- feature_selection_logs
- force_engine
- hyperparameter_optimizer_logs
- input_feature_names
- input_features
- input_features_col_idxs
- label
- label_classes
- label_col_idx
- list_compatible_engines
- metadata
- name
- predict
- predict_class
- predict_shap
- save
- self_evaluation
- serialize
- set_data_spec
- set_feature_selection_logs
- set_metadata
- task
- to_cpp
- to_jax_function
- to_standalone_cc
- to_standalone_java
- to_tensorflow_function
- to_tensorflow_saved_model
- training_logs
- update_with_jax_params
- variable_importances
GenericModel ¶
Bases: ABC
Abstract superclass for all YDF models.
analyze
abstractmethod
¶
analyze(
data: InputDataset,
sampling: float = 1.0,
num_bins: int = 50,
partial_dependence_plot: bool = True,
conditional_expectation_plot: bool = True,
permutation_variable_importance: bool = True,
shap_values: bool = True,
permutation_variable_importance_rounds: int = 1,
num_threads: Optional[int] = None,
maximum_duration: Optional[float] = 20,
features: Optional[List[str]] = None,
) -> Analysis
Analyzes the model's structure and its behavior on a dataset.
An analysis includes structural information (e.g., variable importances) and performance characteristics on the given dataset (e.g., partial dependence plots). Computing the analysis can be time-consuming on large datasets. It is generally recommended to run analysis on a test set, not the training set.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Analyze the model on a test set
test_ds = pd.read_csv("test.csv")
analysis = model.analyze(test_ds)
# Display the analysis report in a notebook
analysis
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset for analysis. |
required |
sampling
|
float
|
The fraction of examples to use for the analysis (e.g., 0.1 for 10%). On large datasets, a smaller sample can significantly speed up computation. |
1.0
|
num_bins
|
int
|
The number of bins for accumulating statistics in plots. More bins provide higher resolution but take longer to compute. |
50
|
partial_dependence_plot
|
bool
|
If |
True
|
conditional_expectation_plot
|
bool
|
If |
True
|
permutation_variable_importance
|
bool
|
If |
True
|
shap_values
|
bool
|
If |
True
|
permutation_variable_importance_rounds
|
int
|
The number of rounds for permutation variable importance. More rounds increase accuracy but take longer. A value of 1 is often sufficient. Set to 0 to disable. |
1
|
num_threads
|
Optional[int]
|
The number of threads to use. Defaults to the number of available CPU cores. |
None
|
maximum_duration
|
Optional[float]
|
The approximate maximum duration of the analysis in seconds. The analysis may run slightly longer. |
20
|
features
|
Optional[List[str]]
|
If specified, PDP and CEP plots will be limited to these features and displayed in this order. |
None
|
Returns:
| Type | Description |
|---|---|
Analysis
|
An |
analyze_prediction
abstractmethod
¶
analyze_prediction(
single_example: InputDataset,
features: Optional[List[str]] = None,
) -> PredictionAnalysis
Explains a single prediction of the model.
This method shows how each feature value contributed to the final
prediction for a specific example. For a global model analysis, use
model.analyze() instead.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Explain the prediction for the first example in the test set
test_ds = pd.read_csv("test.csv")
first_example = test_ds.iloc[:1]
explanation = model.analyze_prediction(first_example)
# Display the explanation in a notebook.
explanation
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
single_example
|
InputDataset
|
A dataset containing a single example to explain. |
required |
features
|
Optional[List[str]]
|
If specified, the analysis will be limited to these features, and they will be displayed in the specified order. |
None
|
Returns:
| Type | Description |
|---|---|
PredictionAnalysis
|
A |
benchmark
abstractmethod
¶
benchmark(
ds: InputDataset,
benchmark_duration: float = 3,
warmup_duration: float = 1,
batch_size: int = 100,
num_threads: Optional[int] = None,
) -> BenchmarkInferenceCCResult
Benchmarks the inference speed of the model on a given dataset.
This method measures the time it takes to run predictions on the dataset
using the Yggdrasil Decision Forests C++ engine. Note that inference times
may vary on different machines or with other APIs. A C++ serving template
can be generated with model.to_cpp().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ds
|
InputDataset
|
The dataset to use for benchmarking. |
required |
benchmark_duration
|
float
|
The target duration of the benchmark in seconds. The actual duration may be slightly different. Must be > 0. |
3
|
warmup_duration
|
float
|
The target duration of the warmup phase in seconds. During this phase, predictions are run but not timed, to warm up caches. Must be > 0. |
1
|
batch_size
|
int
|
The number of examples to process in each batch. The impact of this parameter depends on the machine's architecture (e.g., cache sizes). |
100
|
num_threads
|
Optional[int]
|
The number of threads to use for the benchmark. If not specified, it defaults to the number of available CPU cores. |
None
|
Returns:
| Type | Description |
|---|---|
BenchmarkInferenceCCResult
|
An object containing the benchmark results. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
data_spec
abstractmethod
¶
The data specification of the dataset used to train the model.
Returns:
| Type | Description |
|---|---|
DataSpecification
|
A DataSpecification protobuf object. |
describe
abstractmethod
¶
describe(
output_format: Literal[
"auto", "text", "notebook", "html"
] = "auto",
full_details: bool = False,
) -> Union[str, HtmlNotebookDisplay]
Generates a textual or HTML description of the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_format
|
Literal['auto', 'text', 'notebook', 'html']
|
The format of the output. - "auto": "notebook" in an IPython notebook, "text" otherwise. - "text": A plain text description. - "html": A standalone HTML description. - "notebook": An HTML description for display in a notebook cell. |
'auto'
|
full_details
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Union[str, HtmlNotebookDisplay]
|
The model description as a string or an HTML display object. |
evaluate
abstractmethod
¶
evaluate(
data: InputDataset,
*,
weighted: Optional[bool] = None,
task: Optional[Task] = None,
label: Optional[str] = None,
group: Optional[str] = None,
bootstrapping: Union[bool, int] = False,
ndcg_truncation: int = 5,
mrr_truncation: int = 5,
map_truncation: int = 5,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> Evaluation
Evaluates the quality of a model on a dataset.
In a notebook environment, the returned Evaluation object is displayed as
a rich HTML report with plots.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Evaluate the model on a test dataset
test_ds = pd.read_csv("test.csv")
evaluation = model.evaluate(test_ds)
# Display the evaluation report in a notebook
evaluation
You can also evaluate the model on a different task than it was trained for,
by overriding the task, label, and group arguments.
# Train a regression model
model = ydf.RandomForestLearner(label="price",
task=ydf.Task.REGRESSION).train(...)
# Evaluate it as a ranking model
ranking_evaluation = model.evaluate(
test_ds, task=ydf.Task.RANKING, group="session_id"
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset for evaluation. |
required |
weighted
|
Optional[bool]
|
If |
None
|
task
|
Optional[Task]
|
Overrides the model's task for this evaluation. Defaults to the model's original task. |
None
|
label
|
Optional[str]
|
Overrides the label column for this evaluation. Defaults to the model's original label. |
None
|
group
|
Optional[str]
|
Overrides the grouping column for this evaluation, used for ranking tasks. Defaults to the model's original group column. |
None
|
bootstrapping
|
Union[bool, int]
|
If |
False
|
ndcg_truncation
|
int
|
The truncation level for the NDCG metric. |
5
|
mrr_truncation
|
int
|
The truncation level for the MRR metric. |
5
|
map_truncation
|
int
|
The truncation level for the MAP metric. |
5
|
use_slow_engine
|
bool
|
If |
False
|
num_threads
|
Optional[int]
|
The number of threads to use. Defaults to the number of available CPU cores. |
None
|
Returns:
| Type | Description |
|---|---|
Evaluation
|
An |
feature_selection_logs
abstractmethod
¶
feature_selection_logs() -> Optional[FeatureSelectorLogs]
Retrieves the feature selection logs, if available.
Returns:
| Type | Description |
|---|---|
Optional[FeatureSelectorLogs]
|
The feature selection logs, or |
force_engine
abstractmethod
¶
Forces the model to use a specific inference engine.
By default (engine_name=None), the model automatically uses the fastest
compatible engine. This method allows you to override that behavior.
If an invalid or incompatible engine name is provided, subsequent calls to
predict(), evaluate(), etc., will fail.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
engine_name
|
Optional[str]
|
The name of a compatible engine, or |
required |
hyperparameter_optimizer_logs
abstractmethod
¶
hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]
Returns the logs of the hyperparameter tuning process, if any.
Returns:
| Type | Description |
|---|---|
Optional[OptimizerLogs]
|
An |
Optional[OptimizerLogs]
|
model was not trained with hyperparameter tuning. |
input_feature_names ¶
Returns the names of the input features.
The feature names are sorted by their column index in the data specification.
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of feature name strings. |
input_features ¶
Returns the input features of the model.
The features are sorted by their column index in the data specification.
Returns:
| Type | Description |
|---|---|
Sequence[InputFeature]
|
A list of |
input_features_col_idxs
abstractmethod
¶
Returns the column indices of the input features in the dataspec.
label ¶
Returns the name of the label column.
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The label column name as a string, or |
label_classes ¶
Returns the list of possible label values for a classification model.
The order of the classes in the returned list corresponds to the order of
probabilities in the output of model.predict().
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of class name strings. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model is not a classification model. |
label_col_idx
abstractmethod
¶
Returns the index of the label column in the dataspec.
Returns:
| Type | Description |
|---|---|
int
|
The column index, or -1 if the model has no label. |
list_compatible_engines
abstractmethod
¶
Lists the inference engines compatible with the model.
The engines are sorted from likely-fastest to likely-slowest.
Returns:
| Type | Description |
|---|---|
Sequence[str]
|
A list of names of compatible inference engines. |
metadata
abstractmethod
¶
metadata() -> ModelMetadata
Metadata associated with the model.
A model's metadata contains information that does not influence its predictions, such as the creation time. When distributing a model for wide release, it may be useful to clear or modify the metadata.
Example:
Returns:
| Type | Description |
|---|---|
ModelMetadata
|
The model's metadata object. |
predict
abstractmethod
¶
predict(
data: InputDataset,
*,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> ndarray
Runs the model on a dataset and returns its predictions.
The output is a NumPy array of float32 values. The structure of this
array depends on the model's task. See the "Returns" section for details.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Get predictions on a test dataset
test_ds = pd.read_csv("test.csv")
predictions = model.predict(test_ds)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset to make predictions on. Can be a pandas DataFrame, a dictionary of NumPy arrays, a path to a file, etc. If the dataset contains the label column, it will be ignored. |
required |
use_slow_engine
|
bool
|
If |
False
|
num_threads
|
Optional[int]
|
The number of threads to use for prediction. If |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
ndarray
|
A NumPy array containing the predictions. The shape and content vary by |
|
task |
ndarray
|
|
ndarray
|
|
|
ndarray
|
|
|
ndarray
|
|
|
ndarray
|
|
|
ndarray
|
of
shape |
|
ndarray
|
|
predict_class ¶
predict_class(
data: InputDataset,
*,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> ndarray
Returns the most likely predicted class for a classification model.
This is a convenience method for classification tasks. It returns a NumPy
array of strings representing the predicted class for each example. In case
of a tie in probabilities, the class that appears first in
model.label_classes() is chosen.
For the full class probabilities, use model.predict().
Usage example:
import pandas as pd
import ydf
# Train a classification model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="category").train(train_ds)
# Get the predicted class for each example
test_ds = pd.read_csv("test.csv")
predicted_classes = model.predict_class(test_ds)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset to make predictions on. |
required |
use_slow_engine
|
bool
|
If |
False
|
num_threads
|
Optional[int]
|
The number of threads to use. Defaults to the number of available CPU cores. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
A NumPy array of strings of shape |
ndarray
|
likely predicted class for each example. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model is not a classification model. |
predict_shap ¶
predict_shap(
data: InputDataset, *, num_threads: Optional[int] = None
) -> Tuple[Dict[str, ndarray], ndarray]
Computes SHAP values for each example in the given dataset.
SHAP (SHapley Additive exPlanations) values explain a prediction by
attributing the outcome to each feature. The sum of an example's SHAP values
plus the model's initial prediction (initial_value) equals the model's raw
prediction (before any activation function like sigmoid).
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Compute SHAP values on the test dataset
test_ds = pd.read_csv("test.csv")
shap_values, initial_value = model.predict_shap(test_ds)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset to compute SHAP values for. If it contains the label column, it will be ignored. |
required |
num_threads
|
Optional[int]
|
The number of threads to use. Defaults to the number of available CPU cores. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[Dict[str, ndarray], ndarray]
|
A tuple |
save
abstractmethod
¶
save(
path: str,
advanced_options: ModelIOOptions = ModelIOOptions(),
*,
pure_serving: bool = False
) -> None
Saves the model to a directory.
YDF uses a proprietary format consisting of multiple files in a single directory. This directory should ideally contain only one model.
YDF models can also be exported to other formats, such as TensorFlow
SavedModel (to_tensorflow_saved_model()) or C++ code (to_cpp()).
The model may contain metadata (see model.metadata()). Before distributing
a model, consider clearing this metadata:
model.set_metadata(ydf.ModelMetadata()).
Usage example:
import pandas as pd
import ydf
# Train a Random Forest model
df = pd.read_csv("my_dataset.csv")
model = ydf.RandomForestLearner(label="my_label").train(df)
# Save the model to disk
model.save("/models/my_model")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The path to the directory where the model will be saved. |
required |
advanced_options
|
ModelIOOptions
|
Advanced options for saving the model. |
ModelIOOptions()
|
pure_serving
|
bool
|
If |
False
|
self_evaluation ¶
Returns the model's self-evaluation, computed during training.
The method of self-evaluation depends on the model type. For example, Random Forests use out-of-bag (OOB) evaluation, while Gradient Boosted Trees use evaluation on a validation dataset. Because of this, self- evaluations are not directly comparable between different model types.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
# Get the self-evaluation
self_evaluation = model.self_evaluation()
# In a notebook, this will print a rich report.
self_evaluation
Returns:
| Type | Description |
|---|---|
Evaluation
|
An |
serialize
abstractmethod
¶
Serializes the model into a bytes object.
A serialized model is equivalent to a model saved with model.save(). It
may contain metadata related to training and interpretation. To minimize
its size, you can train with the pure_serving_model=True option in the
learner.
Usage example:
import pandas as pd
import ydf
# Create and train a model
dataset = pd.DataFrame({"feature": [0, 1], "label": [0, 1]})
learner = ydf.RandomForestLearner(label="label")
model = learner.train(dataset)
# Serialize the model to a bytes object
serialized_model = model.serialize()
# Deserialize the model
deserialized_model = ydf.deserialize_model(serialized_model)
# Make predictions with both models
predictions = model.predict(dataset)
deserialized_predictions = deserialized_model.predict(dataset)
Returns:
| Type | Description |
|---|---|
bytes
|
The serialized model as a |
set_data_spec ¶
Updates the data specification of the model.
This is an advanced feature and should be used with caution, as it can easily lead to a broken model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_spec
|
DataSpecification
|
The new DataSpecification protobuf object. |
required |
set_feature_selection_logs
abstractmethod
¶
set_feature_selection_logs(
value: Optional[FeatureSelectorLogs],
) -> None
Sets the feature selection logs for the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Optional[FeatureSelectorLogs]
|
The feature selection logs to set, or |
required |
set_metadata
abstractmethod
¶
set_metadata(metadata: ModelMetadata)
Updates the model's metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
ModelMetadata
|
The new metadata object for the model. |
required |
task
abstractmethod
¶
task() -> Task
The task the model is trained to solve.
Returns:
| Type | Description |
|---|---|
Task
|
The task enum for this model. |
to_cpp
abstractmethod
¶
Generates C++ code (.h file) for running the model.
This method provides a fast and widely compatible way to deploy YDF models
in C++. For applications where binary size is critical, to_standalone_cc
is an alternative that produces much smaller binaries with zero
dependencies, but may be slower and less compatible with all model types.
How to use:
- Generate the header file:
open("model.h", "w").write(model.to_cpp()) - In your Bazel/Blaze
BUILDfile, add the necessary dependencies: - In your C++ code, include the header and use the model:
- The generated
Predictfunction uses placeholder values for features. You will need to modify this function to accept your own input data and populate theexamples->Set(...)calls accordingly. - For optimal performance, pre-allocate and reuse the
examplesandpredictionsobjects for each thread.
The generated file contains further documentation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
A name for the model, used to create a unique C++ namespace. |
'my_model'
|
Returns:
| Type | Description |
|---|---|
str
|
A string containing the C++ header code. |
to_jax_function
abstractmethod
¶
to_jax_function(
jit: bool = True,
apply_activation: bool = True,
leaves_as_params: bool = False,
compatibility: Union[str, Compatibility] = "XLA",
) -> JaxModel
Converts the model into a JAX function for use in JAX ecosystems.
Usage example:
import ydf
import numpy as np
import jax.numpy as jnp
# Train a model
model = ydf.GradientBoostedTreesLearner(label="l").train({
"f1": np.random.random(100),
"l": np.random.randint(2, 100),
})
# Convert to a JAX function
jax_model = model.to_jax_function()
# Make predictions
predictions = jax_model.predict({
"f1": jnp.array([0.1, 0.5, 0.9]),
})
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jit
|
bool
|
If |
True
|
apply_activation
|
bool
|
If |
True
|
leaves_as_params
|
bool
|
If |
False
|
compatibility
|
Union[str, Compatibility]
|
The JAX runtime compatibility. Can be "XLA" (default) or "TFL" (for TensorFlow Lite). |
'XLA'
|
Returns:
| Type | Description |
|---|---|
JaxModel
|
A dataclass containing the JAX prediction function ( |
JaxModel
|
optionally the model parameters ( |
JaxModel
|
( |
to_standalone_cc
abstractmethod
¶
to_standalone_cc(
name: str = "ydf_model",
algorithm: Literal["IF_ELSE", "ROUTING"] = "ROUTING",
classification_output: Literal[
"CLASS", "SCORE", "PROBABILITY"
] = "CLASS",
categorical_from_string: bool = False,
) -> Union[str, Dict[str, bytes]]
Generates standalone, dependency-free C++ code for model inference.
This method is ideal for size-critical applications. See to_cpp for an
alternative with better performance and model compatibility.
How to use:
- Copy the generated C++ code into a
.hfile. - In your C++ code, include the header and call the prediction function: The function is thread-safe.
Alternatively, you can use the cc_ydf_standalone_model Bazel rule for
automated code generation (internal to Google).
- Save the model with
model.save(...)in a directory in Google3. - Create a BUILD file with a filegroup in the model directory e.g.:
- In your library's BUILD, create a "cc_ydf_standalone_model " build rule.
- In your cc_binary or cc_library, add ":my_model" as a dependency.
- In your C++ code, include: Then call:
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
A name for the model, used to create the C++ namespace. |
'ydf_model'
|
algorithm
|
Literal['IF_ELSE', 'ROUTING']
|
The underlying algorithm for prediction. - "ROUTING" (default): Faster and produces a smaller binary. - "IF_ELSE": Generates human-readable if-else conditions. |
'ROUTING'
|
classification_output
|
Literal['CLASS', 'SCORE', 'PROBABILITY']
|
The output format for classification models. - "CLASS" (default): The predicted class index (fast). - "SCORE": The raw scores (e.g., logits) for all classes. - "PROBABILITY": The probabilities for all classes (slower, as it requires a softmax). |
'CLASS'
|
categorical_from_string
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Union[str, Dict[str, bytes]]
|
A string with the C++ source code, or a dictionary of filename to source |
Union[str, Dict[str, bytes]]
|
code if multiple files are generated. |
to_standalone_java
abstractmethod
¶
to_standalone_java(
name: str = "YdfModel",
package_name: str = "com.example.ydfmodel",
classification_output: Literal[
"CLASS", "SCORE", "PROBABILITY"
] = "CLASS",
) -> Dict[str, bytes]
Generates standalone, dependency-free Java code for model inference.
This method is ideal for size-critical applications.
How to use:
-
Call this function to get the generated code and data:
-
The function returns a dictionary containing two items:
- Key:
{name}.java(e.g., "MyYdfModel.java"): Value is the Java source code as bytes. - Key:
{name}Data.bin(e.g., "MyYdfModelData.bin"): Value is the binary model data as bytes.
- Key:
-
Save these files to your Java project:
Place thewith open(f"{name}.java", "wb") as f: f.write(java_files[f"{name}.java"]) with open(f"{name}Data.bin", "wb") as f: f.write(java_files[f"{name}Data.bin"]){name}Data.binfile in the Java classpath, typically in the resources directory. -
In your Java code, import the generated class and use the static
predictmethod:Theimport com.mycompany.myproject.MyYdfModel; // Create an Instance with feature values. // Categorical features are represented by enums in the generated class. MyYdfModel.Instance instance = new MyYdfModel.Instance( 5.0f, // Numerical feature MyYdfModel.FeatureF2.kRed // Categorical feature ); // Get the prediction. float prediction = MyYdfModel.predict(instance);predictfunction is thread-safe. The generated class also contains enums for all categorical features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
A name for the model, used to create the Java class name. |
'YdfModel'
|
package_name
|
str
|
The Java package name for the generated class. |
'com.example.ydfmodel'
|
classification_output
|
Literal['CLASS', 'SCORE', 'PROBABILITY']
|
The output format for classification models. - "CLASS" (default): The predicted class enum. - "SCORE": The raw scores (e.g., logits) for all classes. - "PROBABILITY": The probabilities for all classes. |
'CLASS'
|
Returns:
| Type | Description |
|---|---|
Dict[str, bytes]
|
A dictionary of filename to source code. This includes the Java source |
Dict[str, bytes]
|
file and a binary resource file containing the model data. |
to_tensorflow_function
abstractmethod
¶
to_tensorflow_function(
temp_dir: Optional[str] = None,
can_be_saved: bool = True,
squeeze_binary_classification: bool = True,
force: bool = False,
) -> Module
Converts the model into a callable TensorFlow Module (@tf.function).
This allows the YDF model to be integrated into larger TensorFlow graphs.
Requires ydf-tf (pip install ydf-tf).
Note: Export to TensorFlow is not yet available for Anomaly Detection models.
Usage example:
import ydf
import numpy as np
import tensorflow as tf
# Train a model
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(100),
"l": np.random.randint(2, size=100),
})
# Convert to a TF Module
tf_model_fn = model.to_tensorflow_function()
# Make predictions
predictions = tf_model_fn({"f1": tf.constant([0.1, 0.5, 0.9])})
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
temp_dir
|
Optional[str]
|
A temporary directory for the conversion process. |
None
|
can_be_saved
|
bool
|
If |
True
|
squeeze_binary_classification
|
bool
|
If |
True
|
force
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Module
|
A |
to_tensorflow_saved_model
abstractmethod
¶
to_tensorflow_saved_model(
path: str,
input_model_signature_fn: Any = None,
*,
mode: Literal["keras", "tf"] = "tf",
feature_dtypes: Dict[str, TFDType] = {},
servo_api: bool = False,
feed_example_proto: bool = False,
pre_processing: Optional[Callable] = None,
post_processing: Optional[Callable] = None,
temp_dir: Optional[str] = None,
tensor_specs: Optional[Dict[str, Any]] = None,
feature_specs: Optional[Dict[str, Any]] = None,
force: bool = False
) -> None
Exports the model as a TensorFlow SavedModel.
This function requires TensorFlow and the ydf-tf package to be
installed. Install them by running the command pip install
ydf-tf. The generated SavedModel relies on the
YDF Custom Inference Op. This op is available by
default in various platforms such as Servomatic, TensorFlow Serving, Vertex
AI, and TensorFlow.js.
Usage example:
!pip install ydf-tf
import ydf
import numpy as np
import tensorflow as tf
# Train a model.
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(size=100),
"f2": np.random.random(size=100).astype(dtype=np.float32),
"l": np.random.randint(2, size=100),
})
# Export the model to the TensorFlow SavedModel format.
# The model can be executed with Servomatic, TensorFlow Serving and
# Vertex AI.
model.to_tensorflow_saved_model(path="/tmp/my_model", mode="tf")
# The model can also be loaded in TensorFlow and executed locally.
# Load the TensorFlow Saved model.
tf_model = tf.saved_model.load("/tmp/my_model")
# Make predictions
tf_predictions = tf_model({
"f1": tf.constant(np.random.random(size=10)),
"f2": tf.constant(np.random.random(size=10), dtype=tf.float32),
})
TensorFlow SavedModels do not automatically cast feature values. For
instance, a model trained with a dtype=float32 semantic=numerical feature,
will require for this feature to be fed as float32 numbers during inference.
You can override the dtype of a feature with the feature_dtypes argument:
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
# "f1" is fed as an tf.int64 instead of tf.float64
feature_dtypes={"f1": tf.int64},
)
Some TensorFlow Serving or Servomatic pipelines rely on feed examples as
serialized TensorFlow Example proto (instead of raw tensor values) and/or
wrap the model raw output (e.g. probability predictions) into a special
structure (called the Serving API). You can create models compatible with
those two conventions with feed_example_proto=True and servo_api=True
respectively:
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
feed_example_proto=True,
servo_api=True
)
If your model requires some data preprocessing or post-processing, you can
express them as a @tf.function or a tf module and pass them to the
pre_processing and post_processing arguments respectively.
Warning: When exporting a SavedModel, YDF infers the model signature using
the dtype of the features observed during training. If the signature of the
pre_processing function is different than the signature of the model (e.g.,
the processing creates a new feature), you need to specify the tensor specs
(tensor_specs; if feed_example_proto=False) or feature spec
(feature_specs; if feed_example_proto=True) argument:
# Define a pre-processing function
@tf.function
def pre_processing(raw_features):
features = {**raw_features}
# Create a new feature.
features["sin_f1"] = tf.sin(features["f1"])
# Remove a feature
del features["f1"]
return features
# Create Numpy dataset
raw_dataset = {
"f1": np.random.random(size=100),
"f2": np.random.random(size=100),
"l": np.random.randint(2, size=100),
}
# Apply the preprocessing on the training dataset.
processed_dataset = (
tf.data.Dataset.from_tensor_slices(raw_dataset)
.batch(128) # The batch size has no impact on the model.
.map(preprocessing)
.prefetch(tf.data.AUTOTUNE)
)
# Train a model on the pre-processed dataset.
ydf_model = ydf.RandomForestLearner(
label="l",
task=ydf.Task.CLASSIFICATION,
).train(processed_dataset)
# Export the model to a raw SavedModel model with the pre-processing
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
feed_example_proto=False,
pre_processing=pre_processing,
tensor_specs={
"f1": tf.TensorSpec(shape=[None], name="f1", dtype=tf.float64),
"f2": tf.TensorSpec(shape=[None], name="f2", dtype=tf.float64),
}
)
# Export the model to a SavedModel consuming serialized tf examples with the
# pre-processing
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
feed_example_proto=True,
pre_processing=pre_processing,
feature_specs={
"f1": tf.io.FixedLenFeature(
shape=[], dtype=tf.float32, default_value=math.nan
),
"f2": tf.io.FixedLenFeature(
shape=[], dtype=tf.float32, default_value=math.nan
),
}
)
For more flexibility, use the method to_tensorflow_function instead of
to_tensorflow_saved_model.
Note that export to Tensorflow is not yet available for Isolation Forest models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to store the TensorFlow Decision Forests model. |
required |
input_model_signature_fn
|
Any
|
A lambda that returns the
(Dense,Sparse,Ragged)TensorSpec (or structure of TensorSpec e.g.
dictionary, list) corresponding to input signature of the model. If not
specified, the input model signature is created by
|
None
|
mode
|
Literal['keras', 'tf']
|
How the YDF model is converted into a TensorFlow SavedModel. 1) mode
= "keras" (default): Turn the model into a Keras 2 model using
TensorFlow Decision Forests, and then save it with
|
'tf'
|
feature_dtypes
|
Dict[str, TFDType]
|
Mapping from feature name to TensorFlow dtype. Use this
mapping to override feature dtypes. For instance, numerical features are
encoded with tf.float32 by default. If you plan on feeding tf.float64 or
tf.int32, use |
{}
|
servo_api
|
bool
|
If true, adds a SavedModel signature to make the model
compatible with the |
False
|
feed_example_proto
|
bool
|
If false, the model expects for the input features to be provided as TensorFlow values. This is the most efficient way to make predictions. If true, the model expects for the input features to be provided as a binary serialized TensorFlow Example proto. This is the format expected by VertexAI and most TensorFlow Serving pipelines. |
False
|
pre_processing
|
Optional[Callable]
|
Optional TensorFlow function or module to apply on the
input features before applying the model. If the |
None
|
post_processing
|
Optional[Callable]
|
Optional TensorFlow function or module to apply on the model predictions. Only compatible with mode="tf". |
None
|
temp_dir
|
Optional[str]
|
Temporary directory used during the conversion. If None
(default), uses |
None
|
tensor_specs
|
Optional[Dict[str, Any]]
|
Optional dictionary of |
None
|
feature_specs
|
Optional[Dict[str, Any]]
|
Optional dictionary of |
None
|
force
|
bool
|
Tries to export even in currently unsupported environments. WARNING: Setting this to true may crash the Python runtime. |
False
|
training_logs ¶
Returns the model's training logs.
The training logs contain performance metrics calculated periodically during model training. The content and evaluation method depend on the model type (e.g., out-of-bag for Random Forest, validation set for Gradient Boosted Trees).
Usage example:
import pandas as pd
import ydf
import matplotlib.pyplot as plt
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
# Get the training logs
logs = model.training_logs()
# Plot the accuracy over training iterations
plt.plot(
[log.iteration for log in logs],
[log.evaluation.accuracy for log in logs]
)
plt.xlabel("Iteration (Number of Trees)")
plt.ylabel("Validation Accuracy")
plt.show()
Returns:
| Type | Description |
|---|---|
List[TrainingLogEntry]
|
A list of |
update_with_jax_params
abstractmethod
¶
Updates the model's parameters with values from a JAX fine-tuning process.
This function allows you to take a model fine-tuned in JAX (after being
exported with to_jax_function(leaves_as_params=True)) and update the
original YDF model object with the new parameters.
Usage example:
import ydf
import jax
# Train a model with YDF
# dataset = ...
model = ydf.GradientBoostedTreesLearner(label="l").train(dataset)
# Convert to a JAX function with learnable parameters
jax_model = model.to_jax_function(leaves_as_params=True)
# Fine-tune the parameters in JAX
# jax_model.params = my_fine_tuning_logic(jax_model.params, ...)
# Update the YDF model with the new parameters
model.update_with_jax_params(jax_model.params)
# The YDF model now reflects the fine-tuning
# model.save("/path/to/finetuned_model")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
params
|
Dict[str, Any]
|
A dictionary of model parameters, as produced by
|
required |
variable_importances
abstractmethod
¶
Returns the variable importances (VIs) of the model.
Variable importances indicate how much each feature contributes to the model's predictions. Different VI metrics have different semantics and are generally not comparable.
The available VIs depend on the learning algorithm and its hyperparameters.
For example, for Random Forest, setting
compute_oob_variable_importances=True
enables the computation of permutation out-of-bag VIs.
Usage example:
# Train a Random Forest and enable OOB VI computation.
learner = ydf.RandomForestLearner(
label="species", compute_oob_variable_importances=True
)
model = learner.train(dataset)
# List available VI metrics.
print(model.variable_importances().keys())
# dict_keys(['NUM_AS_ROOT', 'SUM_SCORE', 'MEAN_DECREASE_IN_ACCURACY'])
# Get a specific VI, sorted by importance.
vi = model.variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
# [('bill_length_mm', 0.0713), ('island', 0.0072), ...]
Returns:
| Type | Description |
|---|---|
Dict[str, List[Tuple[float, str]]]
|
A dictionary where keys are the names of the VI metrics and values are |
Dict[str, List[Tuple[float, str]]]
|
lists of |
Dict[str, List[Tuple[float, str]]]
|
order of importance. |