RandomForestModel
- RandomForestModel
- add_tree
- analyze
- analyze_prediction
- benchmark
- data_spec
- describe
- distance
- evaluate
- feature_selection_logs
- force_engine
- get_all_trees
- get_tree
- hyperparameter_optimizer_logs
- input_feature_names
- input_features
- input_features_col_idxs
- iter_trees
- label
- label_classes
- label_col_idx
- list_compatible_engines
- metadata
- name
- num_nodes
- num_trees
- out_of_bag_evaluations
- plot_tree
- predict
- predict_class
- predict_leaves
- predict_shap
- print_tree
- remove_tree
- save
- self_evaluation
- serialize
- set_data_spec
- set_feature_selection_logs
- set_metadata
- set_node_format
- set_tree
- task
- to_cpp
- to_docker
- to_jax_function
- to_standalone_cc
- to_standalone_java
- to_tensorflow_function
- to_tensorflow_saved_model
- training_logs
- update_with_jax_params
- variable_importances
- winner_takes_all
RandomForestModel ¶
Bases: DecisionForestModel
A Random Forest model for prediction and inspection.
add_tree ¶
add_tree(tree: Tree) -> None
Adds a single tree of the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree
|
Tree
|
New tree. |
required |
analyze ¶
analyze(
data: InputDataset,
sampling: float = 1.0,
num_bins: int = 50,
partial_dependence_plot: bool = True,
conditional_expectation_plot: bool = True,
permutation_variable_importance: bool = True,
shap_values: bool = True,
permutation_variable_importance_rounds: int = 1,
num_threads: Optional[int] = None,
maximum_duration: Optional[float] = 20,
features: Optional[List[str]] = None,
) -> Analysis
Analyzes the model's structure and its behavior on a dataset.
An analysis includes structural information (e.g., variable importances) and performance characteristics on the given dataset (e.g., partial dependence plots). Computing the analysis can be time-consuming on large datasets. It is generally recommended to run analysis on a test set, not the training set.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Analyze the model on a test set
test_ds = pd.read_csv("test.csv")
analysis = model.analyze(test_ds)
# Display the analysis report in a notebook
analysis
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset for analysis. |
required |
sampling
|
float
|
The fraction of examples to use for the analysis (e.g., 0.1 for 10%). On large datasets, a smaller sample can significantly speed up computation. |
1.0
|
num_bins
|
int
|
The number of bins for accumulating statistics in plots. More bins provide higher resolution but take longer to compute. |
50
|
partial_dependence_plot
|
bool
|
If |
True
|
conditional_expectation_plot
|
bool
|
If |
True
|
permutation_variable_importance
|
bool
|
If |
True
|
shap_values
|
bool
|
If |
True
|
permutation_variable_importance_rounds
|
int
|
The number of rounds for permutation variable importance. More rounds increase accuracy but take longer. A value of 1 is often sufficient. Set to 0 to disable. |
1
|
num_threads
|
Optional[int]
|
The number of threads to use. Defaults to the number of available CPU cores. |
None
|
maximum_duration
|
Optional[float]
|
The approximate maximum duration of the analysis in seconds. The analysis may run slightly longer. |
20
|
features
|
Optional[List[str]]
|
If specified, PDP and CEP plots will be limited to these features and displayed in this order. |
None
|
Returns:
| Type | Description |
|---|---|
Analysis
|
An |
analyze_prediction ¶
analyze_prediction(
single_example: InputDataset,
features: Optional[List[str]] = None,
) -> PredictionAnalysis
Explains a single prediction of the model.
This method shows how each feature value contributed to the final
prediction for a specific example. For a global model analysis, use
model.analyze() instead.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Explain the prediction for the first example in the test set
test_ds = pd.read_csv("test.csv")
first_example = test_ds.iloc[:1]
explanation = model.analyze_prediction(first_example)
# Display the explanation in a notebook.
explanation
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
single_example
|
InputDataset
|
A dataset containing a single example to explain. |
required |
features
|
Optional[List[str]]
|
If specified, the analysis will be limited to these features, and they will be displayed in the specified order. |
None
|
Returns:
| Type | Description |
|---|---|
PredictionAnalysis
|
A |
benchmark ¶
benchmark(
ds: InputDataset,
benchmark_duration: float = 3,
warmup_duration: float = 1,
batch_size: int = 100,
num_threads: Optional[int] = None,
) -> BenchmarkInferenceCCResult
Benchmarks the inference speed of the model on a given dataset.
This method measures the time it takes to run predictions on the dataset
using the Yggdrasil Decision Forests C++ engine. Note that inference times
may vary on different machines or with other APIs. A C++ serving template
can be generated with model.to_cpp().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ds
|
InputDataset
|
The dataset to use for benchmarking. |
required |
benchmark_duration
|
float
|
The target duration of the benchmark in seconds. The actual duration may be slightly different. Must be > 0. |
3
|
warmup_duration
|
float
|
The target duration of the warmup phase in seconds. During this phase, predictions are run but not timed, to warm up caches. Must be > 0. |
1
|
batch_size
|
int
|
The number of examples to process in each batch. The impact of this parameter depends on the machine's architecture (e.g., cache sizes). |
100
|
num_threads
|
Optional[int]
|
The number of threads to use for the benchmark. If not specified, it defaults to the number of available CPU cores. |
None
|
Returns:
| Type | Description |
|---|---|
BenchmarkInferenceCCResult
|
An object containing the benchmark results. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
data_spec ¶
The data specification of the dataset used to train the model.
Returns:
| Type | Description |
|---|---|
DataSpecification
|
A DataSpecification protobuf object. |
describe ¶
describe(
output_format: Literal[
"auto", "text", "notebook", "html"
] = "auto",
full_details: bool = False,
) -> Union[str, HtmlNotebookDisplay]
Generates a textual or HTML description of the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_format
|
Literal['auto', 'text', 'notebook', 'html']
|
The format of the output. - "auto": "notebook" in an IPython notebook, "text" otherwise. - "text": A plain text description. - "html": A standalone HTML description. - "notebook": An HTML description for display in a notebook cell. |
'auto'
|
full_details
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Union[str, HtmlNotebookDisplay]
|
The model description as a string or an HTML display object. |
distance ¶
Computes the pairwise distance between examples in "data1" and "data2".
If "data2" is not provided, computes the pairwise distance between examples in "data1".
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.
Different models are free to implement different distances with different definitions. For this reason, unless indicated by the model, distances from different models cannot be compared.
The distance is not guaranteed to satisfy the triangular inequality property of metric distances.
Not all models can compute distances. In this case, this function will raise an Exception.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data1
|
InputDataset
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
required |
data2
|
Optional[InputDataset]
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Pairwise distance |
evaluate ¶
evaluate(
data: InputDataset,
*,
weighted: Optional[bool] = None,
task: Optional[Task] = None,
label: Optional[str] = None,
group: Optional[str] = None,
bootstrapping: Union[bool, int] = False,
ndcg_truncation: int = 5,
mrr_truncation: int = 5,
map_truncation: int = 5,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> Evaluation
Evaluates the quality of a model on a dataset.
In a notebook environment, the returned Evaluation object is displayed as
a rich HTML report with plots.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Evaluate the model on a test dataset
test_ds = pd.read_csv("test.csv")
evaluation = model.evaluate(test_ds)
# Display the evaluation report in a notebook
evaluation
You can also evaluate the model on a different task than it was trained for,
by overriding the task, label, and group arguments.
# Train a regression model
model = ydf.RandomForestLearner(label="price",
task=ydf.Task.REGRESSION).train(...)
# Evaluate it as a ranking model
ranking_evaluation = model.evaluate(
test_ds, task=ydf.Task.RANKING, group="session_id"
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset for evaluation. |
required |
weighted
|
Optional[bool]
|
If |
None
|
task
|
Optional[Task]
|
Overrides the model's task for this evaluation. Defaults to the model's original task. |
None
|
label
|
Optional[str]
|
Overrides the label column for this evaluation. Defaults to the model's original label. |
None
|
group
|
Optional[str]
|
Overrides the grouping column for this evaluation, used for ranking tasks. Defaults to the model's original group column. |
None
|
bootstrapping
|
Union[bool, int]
|
If |
False
|
ndcg_truncation
|
int
|
The truncation level for the NDCG metric. |
5
|
mrr_truncation
|
int
|
The truncation level for the MRR metric. |
5
|
map_truncation
|
int
|
The truncation level for the MAP metric. |
5
|
use_slow_engine
|
bool
|
If |
False
|
num_threads
|
Optional[int]
|
The number of threads to use. Defaults to the number of available CPU cores. |
None
|
Returns:
| Type | Description |
|---|---|
Evaluation
|
An |
feature_selection_logs ¶
feature_selection_logs() -> Optional[FeatureSelectorLogs]
Retrieves the feature selection logs, if available.
Returns:
| Type | Description |
|---|---|
Optional[FeatureSelectorLogs]
|
The feature selection logs, or |
force_engine ¶
Forces the model to use a specific inference engine.
By default (engine_name=None), the model automatically uses the fastest
compatible engine. This method allows you to override that behavior.
If an invalid or incompatible engine name is provided, subsequent calls to
predict(), evaluate(), etc., will fail.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
engine_name
|
Optional[str]
|
The name of a compatible engine, or |
required |
get_tree ¶
get_tree(tree_idx: int) -> Tree
Gets a single tree of the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, num_trees()). |
required |
Returns:
| Type | Description |
|---|---|
Tree
|
The tree. |
hyperparameter_optimizer_logs ¶
hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]
Returns the logs of the hyperparameter tuning process, if any.
Returns:
| Type | Description |
|---|---|
Optional[OptimizerLogs]
|
An |
Optional[OptimizerLogs]
|
model was not trained with hyperparameter tuning. |
input_feature_names ¶
Returns the names of the input features.
The feature names are sorted by their column index in the data specification.
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of feature name strings. |
input_features ¶
Returns the input features of the model.
The features are sorted by their column index in the data specification.
Returns:
| Type | Description |
|---|---|
Sequence[InputFeature]
|
A list of |
input_features_col_idxs ¶
Returns the column indices of the input features in the dataspec.
label ¶
Returns the name of the label column.
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The label column name as a string, or |
label_classes ¶
Returns the list of possible label values for a classification model.
The order of the classes in the returned list corresponds to the order of
probabilities in the output of model.predict().
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of class name strings. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model is not a classification model. |
label_col_idx ¶
Returns the index of the label column in the dataspec.
Returns:
| Type | Description |
|---|---|
int
|
The column index, or -1 if the model has no label. |
list_compatible_engines ¶
Lists the inference engines compatible with the model.
The engines are sorted from likely-fastest to likely-slowest.
Returns:
| Type | Description |
|---|---|
Sequence[str]
|
A list of names of compatible inference engines. |
metadata ¶
metadata() -> ModelMetadata
Metadata associated with the model.
A model's metadata contains information that does not influence its predictions, such as the creation time. When distributing a model for wide release, it may be useful to clear or modify the metadata.
Example:
Returns:
| Type | Description |
|---|---|
ModelMetadata
|
The model's metadata object. |
out_of_bag_evaluations ¶
Alias for training_logs() for Random Forest models.
plot_tree ¶
plot_tree(
tree_idx: int = 0,
max_depth: Optional[int] = None,
options: Optional[PlotOptions] = None,
d3js_url: str = "https://d3js.org/d3.v6.min.js",
) -> TreePlot
Plots an interactive HTML rendering of the tree.
Usage example:
import pandas as pd
import ydf
# Create a dataset
train_ds = pd.DataFrame({
"c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
"label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, self.num_trees()). |
0
|
max_depth
|
Optional[int]
|
Maximum tree depth of the plot. Set to None for full depth. |
None
|
options
|
Optional[PlotOptions]
|
Advanced options for plotting. Set to None for default style. |
None
|
d3js_url
|
str
|
URL to load the d3.js library from. |
'https://d3js.org/d3.v6.min.js'
|
Returns:
| Type | Description |
|---|---|
TreePlot
|
In interactive environments, an interactive plot. The HTML source can also |
TreePlot
|
be exported to file. |
predict ¶
predict(
data: InputDataset,
*,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> ndarray
Runs the model on a dataset and returns its predictions.
The output is a NumPy array of float32 values. The structure of this
array depends on the model's task. See the "Returns" section for details.
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Get predictions on a test dataset
test_ds = pd.read_csv("test.csv")
predictions = model.predict(test_ds)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset to make predictions on. Can be a pandas DataFrame, a dictionary of NumPy arrays, a path to a file, etc. If the dataset contains the label column, it will be ignored. |
required |
use_slow_engine
|
bool
|
If |
False
|
num_threads
|
Optional[int]
|
The number of threads to use for prediction. If |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
ndarray
|
A NumPy array containing the predictions. The shape and content vary by |
|
task |
ndarray
|
|
ndarray
|
|
|
ndarray
|
|
|
ndarray
|
|
|
ndarray
|
|
|
ndarray
|
of
shape |
|
ndarray
|
|
predict_class ¶
predict_class(
data: InputDataset,
*,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> ndarray
Returns the most likely predicted class for a classification model.
This is a convenience method for classification tasks. It returns a NumPy
array of strings representing the predicted class for each example. In case
of a tie in probabilities, the class that appears first in
model.label_classes() is chosen.
For the full class probabilities, use model.predict().
Usage example:
import pandas as pd
import ydf
# Train a classification model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="category").train(train_ds)
# Get the predicted class for each example
test_ds = pd.read_csv("test.csv")
predicted_classes = model.predict_class(test_ds)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset to make predictions on. |
required |
use_slow_engine
|
bool
|
If |
False
|
num_threads
|
Optional[int]
|
The number of threads to use. Defaults to the number of available CPU cores. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
A NumPy array of strings of shape |
ndarray
|
likely predicted class for each example. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model is not a classification model. |
predict_leaves ¶
Gets the index of the active leaf in each tree.
The active leaf is the leaf that receives the example during inference.
The returned value "leaves[i,j]" is the index of the active leaf for the i-th example and the j-th tree. Leaves are indexed by depth first exploration with the negative child visited before the positive one.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
Dataset. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Index of the active leaf for each tree in the model. |
predict_shap ¶
predict_shap(
data: InputDataset, *, num_threads: Optional[int] = None
) -> Tuple[Dict[str, ndarray], ndarray]
Computes SHAP values for each example in the given dataset.
SHAP (SHapley Additive exPlanations) values explain a prediction by
attributing the outcome to each feature. The sum of an example's SHAP values
plus the model's initial prediction (initial_value) equals the model's raw
prediction (before any activation function like sigmoid).
Usage example:
import pandas as pd
import ydf
# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Compute SHAP values on the test dataset
test_ds = pd.read_csv("test.csv")
shap_values, initial_value = model.predict_shap(test_ds)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
InputDataset
|
The dataset to compute SHAP values for. If it contains the label column, it will be ignored. |
required |
num_threads
|
Optional[int]
|
The number of threads to use. Defaults to the number of available CPU cores. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[Dict[str, ndarray], ndarray]
|
A tuple |
print_tree ¶
Prints a tree in the terminal.
Usage example:
import pandas as pd
import ydf
# Create a dataset
train_ds = pd.DataFrame({
"c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
"label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, self.num_trees()). |
0
|
max_depth
|
Optional[int]
|
Maximum tree depth of the plot. Set to None for full depth. |
6
|
file
|
Any
|
Where to print the tree. By default, prints on the terminal standard output. |
stdout
|
remove_tree ¶
Removes a single tree of the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, num_trees()). |
required |
save ¶
save(
path: str,
advanced_options=ModelIOOptions(),
*,
pure_serving=False
) -> None
Saves the model to a directory.
YDF uses a proprietary format consisting of multiple files in a single directory. This directory should ideally contain only one model.
YDF models can also be exported to other formats, such as TensorFlow
SavedModel (to_tensorflow_saved_model()) or C++ code (to_cpp()).
The model may contain metadata (see model.metadata()). Before distributing
a model, consider clearing this metadata:
model.set_metadata(ydf.ModelMetadata()).
Usage example:
import pandas as pd
import ydf
# Train a Random Forest model
df = pd.read_csv("my_dataset.csv")
model = ydf.RandomForestLearner(label="my_label").train(df)
# Save the model to disk
model.save("/models/my_model")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The path to the directory where the model will be saved. |
required |
advanced_options
|
ModelIOOptions
|
Advanced options for saving the model. |
ModelIOOptions()
|
pure_serving
|
bool
|
If |
False
|
self_evaluation ¶
Returns the model's self-evaluation.
For Random Forest models, the self-evaluation is out-of-bag evaluation on the full model. Note that the Random Forest models do not use a validation dataset. If out-of-bag evaluation is not enabled, no self-evaluation is computed.
Different models use different methods for self-evaluation. Notably, Gradient Boosted Trees use the evaluation on the validation dataset. Therefore, self-evaluations are not comparable between different model types.
Returns None if no self-evaluation has been computed.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
learner = ydf.RandomForestLearner(label="label",
compute_oob_performances=True)
model = learner.train(train_ds)
self_evaluation = model.self_evaluation()
# In an interactive Python environment, print a rich evaluation report.
self_evaluation
serialize ¶
Serializes the model into a bytes object.
A serialized model is equivalent to a model saved with model.save(). It
may contain metadata related to training and interpretation. To minimize
its size, you can train with the pure_serving_model=True option in the
learner.
Usage example:
import pandas as pd
import ydf
# Create and train a model
dataset = pd.DataFrame({"feature": [0, 1], "label": [0, 1]})
learner = ydf.RandomForestLearner(label="label")
model = learner.train(dataset)
# Serialize the model to a bytes object
serialized_model = model.serialize()
# Deserialize the model
deserialized_model = ydf.deserialize_model(serialized_model)
# Make predictions with both models
predictions = model.predict(dataset)
deserialized_predictions = deserialized_model.predict(dataset)
Returns:
| Type | Description |
|---|---|
bytes
|
The serialized model as a |
set_data_spec ¶
Updates the data specification of the model.
This is an advanced feature and should be used with caution, as it can easily lead to a broken model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_spec
|
DataSpecification
|
The new DataSpecification protobuf object. |
required |
set_feature_selection_logs ¶
set_feature_selection_logs(
value: Optional[FeatureSelectorLogs],
) -> None
Sets the feature selection logs for the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Optional[FeatureSelectorLogs]
|
The feature selection logs to set, or |
required |
set_metadata ¶
set_metadata(metadata: ModelMetadata)
Updates the model's metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
ModelMetadata
|
The new metadata object for the model. |
required |
set_node_format ¶
set_node_format(node_format: NodeFormat) -> None
Set the serialization format for the nodes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
node_format
|
NodeFormat
|
Node format to use when saving the model. |
required |
set_tree ¶
set_tree(tree_idx: int, tree: Tree) -> None
Overrides a single tree of the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, num_trees()). |
required |
tree
|
Tree
|
New tree. |
required |
task ¶
task() -> Task
The task the model is trained to solve.
Returns:
| Type | Description |
|---|---|
Task
|
The task enum for this model. |
to_cpp ¶
Generates C++ code (.h file) for running the model.
This method provides a fast and widely compatible way to deploy YDF models
in C++. For applications where binary size is critical, to_standalone_cc
is an alternative that produces much smaller binaries with zero
dependencies, but may be slower and less compatible with all model types.
How to use:
- Generate the header file:
open("model.h", "w").write(model.to_cpp()) - In your Bazel/Blaze
BUILDfile, add the necessary dependencies: - In your C++ code, include the header and use the model:
- The generated
Predictfunction uses placeholder values for features. You will need to modify this function to accept your own input data and populate theexamples->Set(...)calls accordingly. - For optimal performance, pre-allocate and reuse the
examplesandpredictionsobjects for each thread.
The generated file contains further documentation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
A name for the model, used to create a unique C++ namespace. |
'my_model'
|
Returns:
| Type | Description |
|---|---|
str
|
A string containing the C++ header code. |
to_docker ¶
Exports the model as a self-contained Docker endpoint for deployment.
This function creates a directory with a Dockerfile, the model, and all necessary support files to serve the model over an HTTP endpoint.
Usage example:
import ydf
import numpy as np
# Train a model
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(size=100),
"f2": np.random.random(size=100),
"l": np.random.randint(2, size=100),
})
# Export the model to a Docker endpoint directory
model.to_docker(path="/tmp/my_docker_model")
# See the generated README for instructions
!cat /tmp/my_docker_model/readme.md
# Test the end-point locally
docker build --platform linux/amd64 -t ydf_predict_image /tmp/my_model
docker run --rm -p 8080:8080 -d ydf_predict_image
# Deploy the model on Google Cloud
gcloud run deploy ydf-predict --source /tmp/my_model
# Check the automatically created utility scripts "test_locally.sh" and
# "deploy_in_google_cloud.sh" for more examples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The directory where the Docker endpoint files will be created. |
required |
exist_ok
|
bool
|
If |
False
|
to_jax_function ¶
to_jax_function(
jit: bool = True,
apply_activation: bool = True,
leaves_as_params: bool = False,
compatibility: Union[str, Compatibility] = "XLA",
) -> JaxModel
Converts the model into a JAX function for use in JAX ecosystems.
Usage example:
import ydf
import numpy as np
import jax.numpy as jnp
# Train a model
model = ydf.GradientBoostedTreesLearner(label="l").train({
"f1": np.random.random(100),
"l": np.random.randint(2, 100),
})
# Convert to a JAX function
jax_model = model.to_jax_function()
# Make predictions
predictions = jax_model.predict({
"f1": jnp.array([0.1, 0.5, 0.9]),
})
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jit
|
bool
|
If |
True
|
apply_activation
|
bool
|
If |
True
|
leaves_as_params
|
bool
|
If |
False
|
compatibility
|
Union[str, Compatibility]
|
The JAX runtime compatibility. Can be "XLA" (default) or "TFL" (for TensorFlow Lite). |
'XLA'
|
Returns:
| Type | Description |
|---|---|
JaxModel
|
A dataclass containing the JAX prediction function ( |
JaxModel
|
optionally the model parameters ( |
JaxModel
|
( |
to_standalone_cc ¶
to_standalone_cc(
name: str = "ydf_model",
algorithm: Literal["IF_ELSE", "ROUTING"] = "ROUTING",
classification_output: Literal[
"CLASS", "SCORE", "PROBABILITY"
] = "CLASS",
categorical_from_string: bool = False,
) -> Union[str, Dict[str, bytes]]
Generates standalone, dependency-free C++ code for model inference.
This method is ideal for size-critical applications. See to_cpp for an
alternative with better performance and model compatibility.
How to use:
- Copy the generated C++ code into a
.hfile. - In your C++ code, include the header and call the prediction function: The function is thread-safe.
Alternatively, you can use the cc_ydf_standalone_model Bazel rule for
automated code generation (internal to Google).
- Save the model with
model.save(...)in a directory in Google3. - Create a BUILD file with a filegroup in the model directory e.g.:
- In your library's BUILD, create a "cc_ydf_standalone_model " build rule.
- In your cc_binary or cc_library, add ":my_model" as a dependency.
- In your C++ code, include: Then call:
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
A name for the model, used to create the C++ namespace. |
'ydf_model'
|
algorithm
|
Literal['IF_ELSE', 'ROUTING']
|
The underlying algorithm for prediction. - "ROUTING" (default): Faster and produces a smaller binary. - "IF_ELSE": Generates human-readable if-else conditions. |
'ROUTING'
|
classification_output
|
Literal['CLASS', 'SCORE', 'PROBABILITY']
|
The output format for classification models. - "CLASS" (default): The predicted class index (fast). - "SCORE": The raw scores (e.g., logits) for all classes. - "PROBABILITY": The probabilities for all classes (slower, as it requires a softmax). |
'CLASS'
|
categorical_from_string
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Union[str, Dict[str, bytes]]
|
A string with the C++ source code, or a dictionary of filename to source |
Union[str, Dict[str, bytes]]
|
code if multiple files are generated. |
to_standalone_java ¶
to_standalone_java(
name: str = "YdfModel",
package_name: str = "com.example.ydfmodel",
classification_output: Literal[
"CLASS", "SCORE", "PROBABILITY"
] = "CLASS",
) -> Dict[str, bytes]
Generates standalone, dependency-free Java code for model inference.
This method is ideal for size-critical applications.
How to use:
-
Call this function to get the generated code and data:
-
The function returns a dictionary containing two items:
- Key:
{name}.java(e.g., "MyYdfModel.java"): Value is the Java source code as bytes. - Key:
{name}Data.bin(e.g., "MyYdfModelData.bin"): Value is the binary model data as bytes.
- Key:
-
Save these files to your Java project:
Place thewith open(f"{name}.java", "wb") as f: f.write(java_files[f"{name}.java"]) with open(f"{name}Data.bin", "wb") as f: f.write(java_files[f"{name}Data.bin"]){name}Data.binfile in the Java classpath, typically in the resources directory. -
In your Java code, import the generated class and use the static
predictmethod:Theimport com.mycompany.myproject.MyYdfModel; // Create an Instance with feature values. // Categorical features are represented by enums in the generated class. MyYdfModel.Instance instance = new MyYdfModel.Instance( 5.0f, // Numerical feature MyYdfModel.FeatureF2.kRed // Categorical feature ); // Get the prediction. float prediction = MyYdfModel.predict(instance);predictfunction is thread-safe. The generated class also contains enums for all categorical features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
A name for the model, used to create the Java class name. |
'YdfModel'
|
package_name
|
str
|
The Java package name for the generated class. |
'com.example.ydfmodel'
|
classification_output
|
Literal['CLASS', 'SCORE', 'PROBABILITY']
|
The output format for classification models. - "CLASS" (default): The predicted class enum. - "SCORE": The raw scores (e.g., logits) for all classes. - "PROBABILITY": The probabilities for all classes. |
'CLASS'
|
Returns:
| Type | Description |
|---|---|
Dict[str, bytes]
|
A dictionary of filename to source code. This includes the Java source |
Dict[str, bytes]
|
file and a binary resource file containing the model data. |
to_tensorflow_function ¶
to_tensorflow_function(
temp_dir: Optional[str] = None,
can_be_saved: bool = True,
squeeze_binary_classification: bool = True,
force: bool = False,
) -> Module
Converts the model into a callable TensorFlow Module (@tf.function).
This allows the YDF model to be integrated into larger TensorFlow graphs.
Requires ydf-tf (pip install ydf-tf).
Note: Export to TensorFlow is not yet available for Anomaly Detection models.
Usage example:
import ydf
import numpy as np
import tensorflow as tf
# Train a model
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(100),
"l": np.random.randint(2, size=100),
})
# Convert to a TF Module
tf_model_fn = model.to_tensorflow_function()
# Make predictions
predictions = tf_model_fn({"f1": tf.constant([0.1, 0.5, 0.9])})
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
temp_dir
|
Optional[str]
|
A temporary directory for the conversion process. |
None
|
can_be_saved
|
bool
|
If |
True
|
squeeze_binary_classification
|
bool
|
If |
True
|
force
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Module
|
A |
to_tensorflow_saved_model ¶
to_tensorflow_saved_model(
path: str,
input_model_signature_fn: Any = None,
*,
mode: Literal["keras", "tf"] = "tf",
feature_dtypes: Dict[str, TFDType] = {},
servo_api: bool = False,
feed_example_proto: bool = False,
pre_processing: Optional[Callable] = None,
post_processing: Optional[Callable] = None,
temp_dir: Optional[str] = None,
tensor_specs: Optional[Dict[str, Any]] = None,
feature_specs: Optional[Dict[str, Any]] = None,
force: bool = False
) -> None
Exports the model as a TensorFlow SavedModel.
This function requires TensorFlow and the ydf-tf package to be
installed. Install them by running the command pip install
ydf-tf. The generated SavedModel relies on the
YDF Custom Inference Op. This op is available by
default in various platforms such as Servomatic, TensorFlow Serving, Vertex
AI, and TensorFlow.js.
Usage example:
!pip install ydf-tf
import ydf
import numpy as np
import tensorflow as tf
# Train a model.
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(size=100),
"f2": np.random.random(size=100).astype(dtype=np.float32),
"l": np.random.randint(2, size=100),
})
# Export the model to the TensorFlow SavedModel format.
# The model can be executed with Servomatic, TensorFlow Serving and
# Vertex AI.
model.to_tensorflow_saved_model(path="/tmp/my_model", mode="tf")
# The model can also be loaded in TensorFlow and executed locally.
# Load the TensorFlow Saved model.
tf_model = tf.saved_model.load("/tmp/my_model")
# Make predictions
tf_predictions = tf_model({
"f1": tf.constant(np.random.random(size=10)),
"f2": tf.constant(np.random.random(size=10), dtype=tf.float32),
})
TensorFlow SavedModels do not automatically cast feature values. For
instance, a model trained with a dtype=float32 semantic=numerical feature,
will require for this feature to be fed as float32 numbers during inference.
You can override the dtype of a feature with the feature_dtypes argument:
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
# "f1" is fed as an tf.int64 instead of tf.float64
feature_dtypes={"f1": tf.int64},
)
Some TensorFlow Serving or Servomatic pipelines rely on feed examples as
serialized TensorFlow Example proto (instead of raw tensor values) and/or
wrap the model raw output (e.g. probability predictions) into a special
structure (called the Serving API). You can create models compatible with
those two conventions with feed_example_proto=True and servo_api=True
respectively:
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
feed_example_proto=True,
servo_api=True
)
If your model requires some data preprocessing or post-processing, you can
express them as a @tf.function or a tf module and pass them to the
pre_processing and post_processing arguments respectively.
Warning: When exporting a SavedModel, YDF infers the model signature using
the dtype of the features observed during training. If the signature of the
pre_processing function is different than the signature of the model (e.g.,
the processing creates a new feature), you need to specify the tensor specs
(tensor_specs; if feed_example_proto=False) or feature spec
(feature_specs; if feed_example_proto=True) argument:
# Define a pre-processing function
@tf.function
def pre_processing(raw_features):
features = {**raw_features}
# Create a new feature.
features["sin_f1"] = tf.sin(features["f1"])
# Remove a feature
del features["f1"]
return features
# Create Numpy dataset
raw_dataset = {
"f1": np.random.random(size=100),
"f2": np.random.random(size=100),
"l": np.random.randint(2, size=100),
}
# Apply the preprocessing on the training dataset.
processed_dataset = (
tf.data.Dataset.from_tensor_slices(raw_dataset)
.batch(128) # The batch size has no impact on the model.
.map(preprocessing)
.prefetch(tf.data.AUTOTUNE)
)
# Train a model on the pre-processed dataset.
ydf_model = ydf.RandomForestLearner(
label="l",
task=ydf.Task.CLASSIFICATION,
).train(processed_dataset)
# Export the model to a raw SavedModel model with the pre-processing
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
feed_example_proto=False,
pre_processing=pre_processing,
tensor_specs={
"f1": tf.TensorSpec(shape=[None], name="f1", dtype=tf.float64),
"f2": tf.TensorSpec(shape=[None], name="f2", dtype=tf.float64),
}
)
# Export the model to a SavedModel consuming serialized tf examples with the
# pre-processing
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
feed_example_proto=True,
pre_processing=pre_processing,
feature_specs={
"f1": tf.io.FixedLenFeature(
shape=[], dtype=tf.float32, default_value=math.nan
),
"f2": tf.io.FixedLenFeature(
shape=[], dtype=tf.float32, default_value=math.nan
),
}
)
For more flexibility, use the method to_tensorflow_function instead of
to_tensorflow_saved_model.
Note that export to Tensorflow is not yet available for Isolation Forest models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to store the TensorFlow Decision Forests model. |
required |
input_model_signature_fn
|
Any
|
A lambda that returns the
(Dense,Sparse,Ragged)TensorSpec (or structure of TensorSpec e.g.
dictionary, list) corresponding to input signature of the model. If not
specified, the input model signature is created by
|
None
|
mode
|
Literal['keras', 'tf']
|
How the YDF model is converted into a TensorFlow SavedModel. 1) mode
= "keras" (default): Turn the model into a Keras 2 model using
TensorFlow Decision Forests, and then save it with
|
'tf'
|
feature_dtypes
|
Dict[str, TFDType]
|
Mapping from feature name to TensorFlow dtype. Use this
mapping to override feature dtypes. For instance, numerical features are
encoded with tf.float32 by default. If you plan on feeding tf.float64 or
tf.int32, use |
{}
|
servo_api
|
bool
|
If true, adds a SavedModel signature to make the model
compatible with the |
False
|
feed_example_proto
|
bool
|
If false, the model expects for the input features to be provided as TensorFlow values. This is the most efficient way to make predictions. If true, the model expects for the input features to be provided as a binary serialized TensorFlow Example proto. This is the format expected by VertexAI and most TensorFlow Serving pipelines. |
False
|
pre_processing
|
Optional[Callable]
|
Optional TensorFlow function or module to apply on the
input features before applying the model. If the |
None
|
post_processing
|
Optional[Callable]
|
Optional TensorFlow function or module to apply on the model predictions. Only compatible with mode="tf". |
None
|
temp_dir
|
Optional[str]
|
Temporary directory used during the conversion. If None
(default), uses |
None
|
tensor_specs
|
Optional[Dict[str, Any]]
|
Optional dictionary of |
None
|
feature_specs
|
Optional[Dict[str, Any]]
|
Optional dictionary of |
None
|
force
|
bool
|
Tries to export even in currently unsupported environments. WARNING: Setting this to true may crash the Python runtime. |
False
|
training_logs ¶
Returns the Out-of-Bag evaluation logs for the Random Forest model.
For Random Forests, the training logs contain performance metrics calculated periodically during training using the Out-of-Bag (OOB) data. Each tree in a random forest is trained on a bootstrap sample of the training data. The OOB evaluation uses each training example as a test case for the subset of trees that were not trained on it. This method provides an unbiased estimate of the model's performance without requiring a separate validation set.
To generate these logs, the compute_oob_performances hyperparameter must
be set to True (which is the default). Please note that enabling this
can slightly slow down training.
The OOB evaluation is not computed after every single tree. Instead, the learner calculates it periodically when one of the following is true: - The most recently trained tree is the final tree of the model. - More than 10 seconds have passed since the last OOB evaluation. - More than 10 trees have been trained since the last OOB evaluation.
The returned list of TrainingLogEntry objects is sorted by iteration,
allowing you to easily plot the model's learning curve. The training
iteration is equal to the number of trees when the model was trained. The
last entry in the list represents the final OOB evaluation for the fully
trained model.
For more details, see the explanation of OOB evaluation.
Random Forest models do not return a training_evalution.
For CART models, the training logs have a single entry, containing the evaluation on the validation dataset.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
# Get the training logs
logs = model.training_logs()
# Plot the accuracy.
plt.plot(
[log.iteration for log in logs],
[log.evaluation.accuracy for log in logs]
)
Returns:
| Type | Description |
|---|---|
List[TrainingLogEntry]
|
A list of |
List[TrainingLogEntry]
|
metrics and the number of trees in the model at that point in training. |
List[TrainingLogEntry]
|
Returns an empty list if logs were not generated. |
update_with_jax_params ¶
Updates the model's parameters with values from a JAX fine-tuning process.
This function allows you to take a model fine-tuned in JAX (after being
exported with to_jax_function(leaves_as_params=True)) and update the
original YDF model object with the new parameters.
Usage example:
import ydf
import jax
# Train a model with YDF
# dataset = ...
model = ydf.GradientBoostedTreesLearner(label="l").train(dataset)
# Convert to a JAX function with learnable parameters
jax_model = model.to_jax_function(leaves_as_params=True)
# Fine-tune the parameters in JAX
# jax_model.params = my_fine_tuning_logic(jax_model.params, ...)
# Update the YDF model with the new parameters
model.update_with_jax_params(jax_model.params)
# The YDF model now reflects the fine-tuning
# model.save("/path/to/finetuned_model")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
params
|
Dict[str, Any]
|
A dictionary of model parameters, as produced by
|
required |
variable_importances ¶
Returns the variable importances (VIs) of the model.
Variable importances indicate how much each feature contributes to the model's predictions. Different VI metrics have different semantics and are generally not comparable.
The available VIs depend on the learning algorithm and its hyperparameters.
For example, for Random Forest, setting
compute_oob_variable_importances=True
enables the computation of permutation out-of-bag VIs.
Usage example:
# Train a Random Forest and enable OOB VI computation.
learner = ydf.RandomForestLearner(
label="species", compute_oob_variable_importances=True
)
model = learner.train(dataset)
# List available VI metrics.
print(model.variable_importances().keys())
# dict_keys(['NUM_AS_ROOT', 'SUM_SCORE', 'MEAN_DECREASE_IN_ACCURACY'])
# Get a specific VI, sorted by importance.
vi = model.variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
# [('bill_length_mm', 0.0713), ('island', 0.0072), ...]
Returns:
| Type | Description |
|---|---|
Dict[str, List[Tuple[float, str]]]
|
A dictionary where keys are the names of the VI metrics and values are |
Dict[str, List[Tuple[float, str]]]
|
lists of |
Dict[str, List[Tuple[float, str]]]
|
order of importance. |
winner_takes_all ¶
Returns if the model uses a winner-takes-all strategy for classification.
This parameter determines how to aggregate individual tree votes during
inference in a classification random forest. It is defined by the
winner_take_all Random Forest learner hyper-parameter.
If true, each tree votes for a single class, which is the traditional random forest inference method. If false, each tree outputs a probability distribution across all classes.
If the model is not a classification model, the return value of this function is arbitrary and does not influence model inference.