Skip to content

GradientBoostedTreesModel

GradientBoostedTreesModel

GradientBoostedTreesModel(raw_model: GenericCCModel)

Bases: DecisionForestModel

A Gradient Boosted Trees model for prediction and inspection.

activation

activation() -> Activation

The model activation function.

add_tree

add_tree(tree: Tree) -> None

Adds a single tree of the model.

Parameters:

Name Type Description Default
tree Tree

New tree.

required

analyze

analyze(
    data: InputDataset,
    sampling: float = 1.0,
    num_bins: int = 50,
    partial_dependence_plot: bool = True,
    conditional_expectation_plot: bool = True,
    permutation_variable_importance: bool = True,
    shap_values: bool = True,
    permutation_variable_importance_rounds: int = 1,
    num_threads: Optional[int] = None,
    maximum_duration: Optional[float] = 20,
    features: Optional[List[str]] = None,
) -> Analysis

Analyzes the model's structure and its behavior on a dataset.

An analysis includes structural information (e.g., variable importances) and performance characteristics on the given dataset (e.g., partial dependence plots). Computing the analysis can be time-consuming on large datasets. It is generally recommended to run analysis on a test set, not the training set.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Analyze the model on a test set
test_ds = pd.read_csv("test.csv")
analysis = model.analyze(test_ds)

# Display the analysis report in a notebook
analysis

Parameters:

Name Type Description Default
data InputDataset

The dataset for analysis.

required
sampling float

The fraction of examples to use for the analysis (e.g., 0.1 for 10%). On large datasets, a smaller sample can significantly speed up computation.

1.0
num_bins int

The number of bins for accumulating statistics in plots. More bins provide higher resolution but take longer to compute.

50
partial_dependence_plot bool

If True, computes Partial Dependence Plots (PDPs), which can be computationally expensive.

True
conditional_expectation_plot bool

If True, computes Conditional Expectation Plots (CEPs), which are computationally cheap.

True
permutation_variable_importance bool

If True, computes permutation variable importance.

True
shap_values bool

If True, computes SHAP-based metrics.

True
permutation_variable_importance_rounds int

The number of rounds for permutation variable importance. More rounds increase accuracy but take longer. A value of 1 is often sufficient. Set to 0 to disable.

1
num_threads Optional[int]

The number of threads to use. Defaults to the number of available CPU cores.

None
maximum_duration Optional[float]

The approximate maximum duration of the analysis in seconds. The analysis may run slightly longer.

20
features Optional[List[str]]

If specified, PDP and CEP plots will be limited to these features and displayed in this order.

None

Returns:

Type Description
Analysis

An Analysis object containing the results.

analyze_prediction

analyze_prediction(
    single_example: InputDataset,
    features: Optional[List[str]] = None,
) -> PredictionAnalysis

Explains a single prediction of the model.

This method shows how each feature value contributed to the final prediction for a specific example. For a global model analysis, use model.analyze() instead.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Explain the prediction for the first example in the test set
test_ds = pd.read_csv("test.csv")
first_example = test_ds.iloc[:1]
explanation = model.analyze_prediction(first_example)

# Display the explanation in a notebook.
explanation

Parameters:

Name Type Description Default
single_example InputDataset

A dataset containing a single example to explain.

required
features Optional[List[str]]

If specified, the analysis will be limited to these features, and they will be displayed in the specified order.

None

Returns:

Type Description
PredictionAnalysis

A PredictionAnalysis object containing the explanation.

benchmark

benchmark(
    ds: InputDataset,
    benchmark_duration: float = 3,
    warmup_duration: float = 1,
    batch_size: int = 100,
    num_threads: Optional[int] = None,
) -> BenchmarkInferenceCCResult

Benchmarks the inference speed of the model on a given dataset.

This method measures the time it takes to run predictions on the dataset using the Yggdrasil Decision Forests C++ engine. Note that inference times may vary on different machines or with other APIs. A C++ serving template can be generated with model.to_cpp().

Parameters:

Name Type Description Default
ds InputDataset

The dataset to use for benchmarking.

required
benchmark_duration float

The target duration of the benchmark in seconds. The actual duration may be slightly different. Must be > 0.

3
warmup_duration float

The target duration of the warmup phase in seconds. During this phase, predictions are run but not timed, to warm up caches. Must be > 0.

1
batch_size int

The number of examples to process in each batch. The impact of this parameter depends on the machine's architecture (e.g., cache sizes).

100
num_threads Optional[int]

The number of threads to use for the benchmark. If not specified, it defaults to the number of available CPU cores.

None

Returns:

Type Description
BenchmarkInferenceCCResult

An object containing the benchmark results.

Raises:

Type Description
ValueError

If benchmark_duration, warmup_duration, or batch_size are not positive.

data_spec

data_spec() -> DataSpecification

The data specification of the dataset used to train the model.

Returns:

Type Description
DataSpecification

A DataSpecification protobuf object.

describe

describe(
    output_format: Literal[
        "auto", "text", "notebook", "html"
    ] = "auto",
    full_details: bool = False,
) -> Union[str, HtmlNotebookDisplay]

Generates a textual or HTML description of the model.

Parameters:

Name Type Description Default
output_format Literal['auto', 'text', 'notebook', 'html']

The format of the output. - "auto": "notebook" in an IPython notebook, "text" otherwise. - "text": A plain text description. - "html": A standalone HTML description. - "notebook": An HTML description for display in a notebook cell.

'auto'
full_details bool

If True, the full model structure is included, which can be very large.

False

Returns:

Type Description
Union[str, HtmlNotebookDisplay]

The model description as a string or an HTML display object.

distance

distance(
    data1: InputDataset,
    data2: Optional[InputDataset] = None,
) -> ndarray

Computes the pairwise distance between examples in "data1" and "data2".

If "data2" is not provided, computes the pairwise distance between examples in "data1".

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.

Different models are free to implement different distances with different definitions. For this reason, unless indicated by the model, distances from different models cannot be compared.

The distance is not guaranteed to satisfy the triangular inequality property of metric distances.

Not all models can compute distances. In this case, this function will raise an Exception.

Parameters:

Name Type Description Default
data1 InputDataset

Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.

required
data2 Optional[InputDataset]

Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.

None

Returns:

Type Description
ndarray

Pairwise distance

evaluate

evaluate(
    data: InputDataset,
    *,
    weighted: Optional[bool] = None,
    task: Optional[Task] = None,
    label: Optional[str] = None,
    group: Optional[str] = None,
    bootstrapping: Union[bool, int] = False,
    ndcg_truncation: int = 5,
    mrr_truncation: int = 5,
    map_truncation: int = 5,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> Evaluation

Evaluates the quality of a model on a dataset.

In a notebook environment, the returned Evaluation object is displayed as a rich HTML report with plots.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Evaluate the model on a test dataset
test_ds = pd.read_csv("test.csv")
evaluation = model.evaluate(test_ds)

# Display the evaluation report in a notebook
evaluation

You can also evaluate the model on a different task than it was trained for, by overriding the task, label, and group arguments.

# Train a regression model
model = ydf.RandomForestLearner(label="price",
task=ydf.Task.REGRESSION).train(...)

# Evaluate it as a ranking model
ranking_evaluation = model.evaluate(
    test_ds, task=ydf.Task.RANKING, group="session_id"
)

Parameters:

Name Type Description Default
data InputDataset

The dataset for evaluation.

required
weighted Optional[bool]

If True, the evaluation is weighted using the training weights. If False, it is unweighted. If None (default), it defaults to False with a warning if the model was trained with weights. The default value will change to True in a future version.

None
task Optional[Task]

Overrides the model's task for this evaluation. Defaults to the model's original task.

None
label Optional[str]

Overrides the label column for this evaluation. Defaults to the model's original label.

None
group Optional[str]

Overrides the grouping column for this evaluation, used for ranking tasks. Defaults to the model's original group column.

None
bootstrapping Union[bool, int]

If True, enables bootstrapping with 2000 samples to compute confidence intervals and statistical tests. If an integer (>= 100) is provided, it specifies the number of samples.

False
ndcg_truncation int

The truncation level for the NDCG metric.

5
mrr_truncation int

The truncation level for the MRR metric.

5
map_truncation int

The truncation level for the MAP metric.

5
use_slow_engine bool

If True, uses a slower, more robust inference engine. See predict() for details.

False
num_threads Optional[int]

The number of threads to use. Defaults to the number of available CPU cores.

None

Returns:

Type Description
Evaluation

An Evaluation object containing the model's performance metrics.

feature_selection_logs

feature_selection_logs() -> Optional[FeatureSelectorLogs]

Retrieves the feature selection logs, if available.

Returns:

Type Description
Optional[FeatureSelectorLogs]

The feature selection logs, or None if they are not available.

force_engine

force_engine(engine_name: Optional[str]) -> None

Forces the model to use a specific inference engine.

By default (engine_name=None), the model automatically uses the fastest compatible engine. This method allows you to override that behavior.

If an invalid or incompatible engine name is provided, subsequent calls to predict(), evaluate(), etc., will fail.

Parameters:

Name Type Description Default
engine_name Optional[str]

The name of a compatible engine, or None to restore automatic selection.

required

get_all_trees

get_all_trees() -> Sequence[Tree]

Returns all the trees in the model.

get_tree

get_tree(tree_idx: int) -> Tree

Gets a single tree of the model.

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, num_trees()).

required

Returns:

Type Description
Tree

The tree.

hyperparameter_optimizer_logs

hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]

Returns the logs of the hyperparameter tuning process, if any.

Returns:

Type Description
Optional[OptimizerLogs]

An OptimizerLogs object containing the tuning trials, or None if the

Optional[OptimizerLogs]

model was not trained with hyperparameter tuning.

initial_predictions

initial_predictions() -> NDArray[float]

Returns the model's initial predictions (i.e. the model bias).

input_feature_names

input_feature_names() -> List[str]

Returns the names of the input features.

The feature names are sorted by their column index in the data specification.

Returns:

Type Description
List[str]

A list of feature name strings.

input_features

input_features() -> Sequence[InputFeature]

Returns the input features of the model.

The features are sorted by their column index in the data specification.

Returns:

Type Description
Sequence[InputFeature]

A list of InputFeature objects.

input_features_col_idxs

input_features_col_idxs() -> Sequence[int]

Returns the column indices of the input features in the dataspec.

iter_trees

iter_trees() -> Iterator[Tree]

Returns an iterator over all the trees in the model.

label

label() -> Optional[str]

Returns the name of the label column.

Returns:

Type Description
Optional[str]

The label column name as a string, or None if the model has no label.

label_classes

label_classes() -> List[str]

Returns the list of possible label values for a classification model.

The order of the classes in the returned list corresponds to the order of probabilities in the output of model.predict().

Returns:

Type Description
List[str]

A list of class name strings.

Raises:

Type Description
ValueError

If the model is not a classification model.

label_col_idx

label_col_idx() -> int

Returns the index of the label column in the dataspec.

Returns:

Type Description
int

The column index, or -1 if the model has no label.

list_compatible_engines

list_compatible_engines() -> Sequence[str]

Lists the inference engines compatible with the model.

The engines are sorted from likely-fastest to likely-slowest.

Returns:

Type Description
Sequence[str]

A list of names of compatible inference engines.

metadata

metadata() -> ModelMetadata

Metadata associated with the model.

A model's metadata contains information that does not influence its predictions, such as the creation time. When distributing a model for wide release, it may be useful to clear or modify the metadata.

Example:

# Clear the metadata
model.set_metadata(ydf.ModelMetadata())

Returns:

Type Description
ModelMetadata

The model's metadata object.

name

name() -> str

Returns the name of the model type (e.g., "RANDOM_FOREST").

num_nodes

num_nodes() -> int

Returns the number of nodes in the decision forest.

num_trees

num_trees() -> int

Returns the number of trees in the decision forest.

num_trees_per_iteration

num_trees_per_iteration() -> int

The number of trees trained per gradient boosting iteration.

output_logits

output_logits() -> bool

If true, the model outputs logits instead of probabilities.

Only for classification models. If false, the model outputs probabilities. This is False by default. Note that model probabilities are (by default) not calibrated.

The value of output_logits is serialized with the model and persists if the model is saved and loaded.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(
    label="label", task=ydf.Task.CLASSIFICATION
).train(train_ds)

# Check default value
print(f"Outputs logits: {model.output_logits()}")

# By default, predictions are probabilities
print("Probabilities:", model.predict(train_ds))

model.set_output_logits(True)
print(f"Outputs logits: {model.output_logits()}")
# Now, predictions are logits
print("Logits:", model.predict(train_ds))

Returns:

Type Description
bool

Whether the model outputs logits.

plot_tree

plot_tree(
    tree_idx: int = 0,
    max_depth: Optional[int] = None,
    options: Optional[PlotOptions] = None,
    d3js_url: str = "https://d3js.org/d3.v6.min.js",
) -> TreePlot

Plots an interactive HTML rendering of the tree.

Usage example:

import pandas as pd
import ydf

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, self.num_trees()).

0
max_depth Optional[int]

Maximum tree depth of the plot. Set to None for full depth.

None
options Optional[PlotOptions]

Advanced options for plotting. Set to None for default style.

None
d3js_url str

URL to load the d3.js library from.

'https://d3js.org/d3.v6.min.js'

Returns:

Type Description
TreePlot

In interactive environments, an interactive plot. The HTML source can also

TreePlot

be exported to file.

predict

predict(
    data: InputDataset,
    *,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> ndarray

Runs the model on a dataset and returns its predictions.

The output is a NumPy array of float32 values. The structure of this array depends on the model's task. See the "Returns" section for details.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Get predictions on a test dataset
test_ds = pd.read_csv("test.csv")
predictions = model.predict(test_ds)

Parameters:

Name Type Description Default
data InputDataset

The dataset to make predictions on. Can be a pandas DataFrame, a dictionary of NumPy arrays, a path to a file, etc. If the dataset contains the label column, it will be ignored.

required
use_slow_engine bool

If True, uses a slower, more robust inference engine. This is a fallback for rare edge cases where the default engines might fail (e.g., models with a very large number of categorical conditions). If you encounter such a case, please report it to the YDF developers.

False
num_threads Optional[int]

The number of threads to use for prediction. If None, it defaults to the number of available CPU cores.

None

Returns:

Name Type Description
ndarray

A NumPy array containing the predictions. The shape and content vary by

task ndarray
ndarray
  • Task.CLASSIFICATION:
  • Binary Classification (2 classes): An array of shape [num_examples]. Each value is the probability of the positive class (at model.label_classes()[1]). The probability of the negative class is 1 - prediction.
  • Multi-class Classification (>2 classes): An array of shape [num_examples, num_classes]. Each row contains the probabilities for each class, in the order of model.label_classes().
ndarray
  • Task.REGRESSION: An array of shape [num_examples], where each value is the predicted numerical outcome.
ndarray
  • Task.RANKING: An array of shape [num_examples], where each value is the predicted score for the item. Higher scores indicate higher rank.
ndarray
  • Task.CATEGORICAL_UPLIFT and Task.NUMERICAL_UPLIFT: An array
ndarray

of shape [num_examples]. Each value is the predicted uplift, representing the incremental effect of the treatment.

ndarray
  • Task.ANOMALY_DETECTION: An array of shape [num_examples], where each value is the anomaly score (0 for most normal, 1 for most anomalous).

predict_class

predict_class(
    data: InputDataset,
    *,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> ndarray

Returns the most likely predicted class for a classification model.

This is a convenience method for classification tasks. It returns a NumPy array of strings representing the predicted class for each example. In case of a tie in probabilities, the class that appears first in model.label_classes() is chosen.

For the full class probabilities, use model.predict().

Usage example:

import pandas as pd
import ydf

# Train a classification model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="category").train(train_ds)

# Get the predicted class for each example
test_ds = pd.read_csv("test.csv")
predicted_classes = model.predict_class(test_ds)

Parameters:

Name Type Description Default
data InputDataset

The dataset to make predictions on.

required
use_slow_engine bool

If True, uses a slower, more robust inference engine. See predict() for details.

False
num_threads Optional[int]

The number of threads to use. Defaults to the number of available CPU cores.

None

Returns:

Type Description
ndarray

A NumPy array of strings of shape [num_examples], containing the most

ndarray

likely predicted class for each example.

Raises:

Type Description
ValueError

If the model is not a classification model.

predict_leaves

predict_leaves(data: InputDataset) -> ndarray

Gets the index of the active leaf in each tree.

The active leaf is the leaf that receives the example during inference.

The returned value "leaves[i,j]" is the index of the active leaf for the i-th example and the j-th tree. Leaves are indexed by depth first exploration with the negative child visited before the positive one.

Parameters:

Name Type Description Default
data InputDataset

Dataset.

required

Returns:

Type Description
ndarray

Index of the active leaf for each tree in the model.

predict_shap

predict_shap(
    data: InputDataset, *, num_threads: Optional[int] = None
) -> Tuple[Dict[str, ndarray], ndarray]

Computes SHAP values for each example in the given dataset.

SHAP (SHapley Additive exPlanations) values explain a prediction by attributing the outcome to each feature. The sum of an example's SHAP values plus the model's initial prediction (initial_value) equals the model's raw prediction (before any activation function like sigmoid).

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Compute SHAP values on the test dataset
test_ds = pd.read_csv("test.csv")
shap_values, initial_value = model.predict_shap(test_ds)

Parameters:

Name Type Description Default
data InputDataset

The dataset to compute SHAP values for. If it contains the label column, it will be ignored.

required
num_threads Optional[int]

The number of threads to use. Defaults to the number of available CPU cores.

None

Returns:

Type Description
Tuple[Dict[str, ndarray], ndarray]

A tuple (shap_values, initial_value) where: - shap_values: A dictionary mapping feature names to NumPy arrays. Each array has a shape of [num_examples] or [num_examples, num_outputs], containing the SHAP values for that feature. - initial_value: A NumPy array of shape [] or [num_outputs] representing the model's initial prediction (i.e., offset).

print_tree

print_tree(
    tree_idx: int = 0,
    max_depth: Optional[int] = 6,
    file: Any = stdout,
) -> None

Prints a tree in the terminal.

Usage example:

import pandas as pd
import ydf

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, self.num_trees()).

0
max_depth Optional[int]

Maximum tree depth of the plot. Set to None for full depth.

6
file Any

Where to print the tree. By default, prints on the terminal standard output.

stdout

remove_tree

remove_tree(tree_idx: int) -> None

Removes a single tree of the model.

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, num_trees()).

required

save

save(
    path: str,
    advanced_options=ModelIOOptions(),
    *,
    pure_serving=False
) -> None

Saves the model to a directory.

YDF uses a proprietary format consisting of multiple files in a single directory. This directory should ideally contain only one model.

YDF models can also be exported to other formats, such as TensorFlow SavedModel (to_tensorflow_saved_model()) or C++ code (to_cpp()).

The model may contain metadata (see model.metadata()). Before distributing a model, consider clearing this metadata: model.set_metadata(ydf.ModelMetadata()).

Usage example:

import pandas as pd
import ydf

# Train a Random Forest model
df = pd.read_csv("my_dataset.csv")
model = ydf.RandomForestLearner(label="my_label").train(df)

# Save the model to disk
model.save("/models/my_model")

Parameters:

Name Type Description Default
path str

The path to the directory where the model will be saved.

required
advanced_options ModelIOOptions

Advanced options for saving the model.

ModelIOOptions()
pure_serving bool

If True, saves a smaller version of the model suitable for serving by removing training-specific metadata and debug information. This might require more memory during the saving process, but the resulting model on disk will be smaller.

False

self_evaluation

self_evaluation() -> Optional[Evaluation]

Returns the model's self-evaluation.

For Gradient Boosted Trees models, the self-evaluation is the evaluation on the validation dataset. Note that the validation dataset is extracted automatically if not explicitly given. If the validation dataset is deactivated, no self-evaluation is computed.

Different models use different methods for self-evaluation. Notably, Random Forests use the last Out-Of-Bag evaluation. Therefore, self-evaluations are not comparable between different model types.

Returns None if no self-evaluation has been computed.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

self_evaluation = model.self_evaluation()
# In an interactive Python environment, print a rich evaluation report.
self_evaluation

serialize

serialize() -> bytes

Serializes the model into a bytes object.

A serialized model is equivalent to a model saved with model.save(). It may contain metadata related to training and interpretation. To minimize its size, you can train with the pure_serving_model=True option in the learner.

Usage example:

import pandas as pd
import ydf

# Create and train a model
dataset = pd.DataFrame({"feature": [0, 1], "label": [0, 1]})
learner = ydf.RandomForestLearner(label="label")
model = learner.train(dataset)

# Serialize the model to a bytes object
serialized_model = model.serialize()

# Deserialize the model
deserialized_model = ydf.deserialize_model(serialized_model)

# Make predictions with both models
predictions = model.predict(dataset)
deserialized_predictions = deserialized_model.predict(dataset)

Returns:

Type Description
bytes

The serialized model as a bytes object.

set_data_spec

set_data_spec(data_spec: DataSpecification) -> None

Updates the data specification of the model.

This is an advanced feature and should be used with caution, as it can easily lead to a broken model.

Parameters:

Name Type Description Default
data_spec DataSpecification

The new DataSpecification protobuf object.

required

set_feature_selection_logs

set_feature_selection_logs(
    value: Optional[FeatureSelectorLogs],
) -> None

Sets the feature selection logs for the model.

Parameters:

Name Type Description Default
value Optional[FeatureSelectorLogs]

The feature selection logs to set, or None to clear them.

required

set_initial_predictions

set_initial_predictions(
    initial_predictions: Sequence[float],
)

Sets the model's initial predictions (i.e. the model bias).

set_metadata

set_metadata(metadata: ModelMetadata)

Updates the model's metadata.

Parameters:

Name Type Description Default
metadata ModelMetadata

The new metadata object for the model.

required

set_node_format

set_node_format(node_format: NodeFormat) -> None

Set the serialization format for the nodes.

Parameters:

Name Type Description Default
node_format NodeFormat

Node format to use when saving the model.

required

set_output_logits

set_output_logits(output_logits: bool) -> None

Sets whether the model outputs logits or probabilities.

Only for classification models. If false, the model outputs probabilities. If true, the model outputs logits. This is False by default. Note that model probabilities are (by default) not calibrated.

The value of output_logits is serialized with the model and persists if the model is saved and loaded.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(
    label="label", task=ydf.Task.CLASSIFICATION
).train(train_ds)

# Check default value
print(f"Outputs logits: {model.output_logits()}")

# By default, predictions are probabilities
print("Probabilities:", model.predict(train_ds))

model.set_output_logits(True)
print(f"Outputs logits: {model.output_logits()}")
# Now, predictions are logits
print("Logits:", model.predict(train_ds))

Parameters:

Name Type Description Default
output_logits bool

Whether to output logits instead of probabilities.

required

set_tree

set_tree(tree_idx: int, tree: Tree) -> None

Overrides a single tree of the model.

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, num_trees()).

required
tree Tree

New tree.

required

task

task() -> Task

The task the model is trained to solve.

Returns:

Type Description
Task

The task enum for this model.

to_cpp

to_cpp(key: str = 'my_model') -> str

Generates C++ code (.h file) for running the model.

This method provides a fast and widely compatible way to deploy YDF models in C++. For applications where binary size is critical, to_standalone_cc is an alternative that produces much smaller binaries with zero dependencies, but may be slower and less compatible with all model types.

How to use:

  1. Generate the header file: open("model.h", "w").write(model.to_cpp())
  2. In your Bazel/Blaze BUILD file, add the necessary dependencies:
    //third_party/absl/status:statusor
    //third_party/absl/strings
    //external/ydf_cc/yggdrasil_decision_forests/api:serving
    
  3. In your C++ code, include the header and use the model:
    #include "path/to/model.h"
    #include "yggdrasil_decision_forests/api/serving.h"
    
    namespace ydf = yggdrasil_decision_forests;
    // Load the model once.
    const auto model = ydf::exported_model_123::LoadModel("<path to model
    dir>");
    // Run predictions.
    predictions = model.Predict(...);
    ...
    
  4. The generated Predict function uses placeholder values for features. You will need to modify this function to accept your own input data and populate the examples->Set(...) calls accordingly.
  5. For optimal performance, pre-allocate and reuse the examples and predictions objects for each thread.

The generated file contains further documentation.

Parameters:

Name Type Description Default
key str

A name for the model, used to create a unique C++ namespace.

'my_model'

Returns:

Type Description
str

A string containing the C++ header code.

to_docker

to_docker(path: str, exist_ok: bool = False) -> None

Exports the model as a self-contained Docker endpoint for deployment.

This function creates a directory with a Dockerfile, the model, and all necessary support files to serve the model over an HTTP endpoint.

Usage example:

import ydf
import numpy as np

# Train a model
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
})

# Export the model to a Docker endpoint directory
model.to_docker(path="/tmp/my_docker_model")

# See the generated README for instructions
!cat /tmp/my_docker_model/readme.md

# Test the end-point locally
docker build --platform linux/amd64 -t ydf_predict_image /tmp/my_model
docker run --rm -p 8080:8080 -d ydf_predict_image

# Deploy the model on Google Cloud
gcloud run deploy ydf-predict --source /tmp/my_model

# Check the automatically created utility scripts "test_locally.sh" and
# "deploy_in_google_cloud.sh" for more examples.

Parameters:

Name Type Description Default
path str

The directory where the Docker endpoint files will be created.

required
exist_ok bool

If False (default), raises an error if the path directory already exists. If True, overwrites the content of the directory if it exists.

False

to_jax_function

to_jax_function(
    jit: bool = True,
    apply_activation: bool = True,
    leaves_as_params: bool = False,
    compatibility: Union[str, Compatibility] = "XLA",
) -> JaxModel

Converts the model into a JAX function for use in JAX ecosystems.

Usage example:

import ydf
import numpy as np
import jax.numpy as jnp

# Train a model
model = ydf.GradientBoostedTreesLearner(label="l").train({
    "f1": np.random.random(100),
    "l": np.random.randint(2, 100),
})

# Convert to a JAX function
jax_model = model.to_jax_function()

# Make predictions
predictions = jax_model.predict({
    "f1": jnp.array([0.1, 0.5, 0.9]),
})

Parameters:

Name Type Description Default
jit bool

If True, the returned function will be just-in-time compiled with @jax.jit.

True
apply_activation bool

If True, the model's activation function (e.g., sigmoid) will be applied to the output.

True
leaves_as_params bool

If True, the model's leaf values are exported as learnable parameters. The returned object will contain a params attribute, which must be passed to the predict function. This is useful for fine-tuning.

False
compatibility Union[str, Compatibility]

The JAX runtime compatibility. Can be "XLA" (default) or "TFL" (for TensorFlow Lite).

'XLA'

Returns:

Type Description
JaxModel

A dataclass containing the JAX prediction function (predict), and

JaxModel

optionally the model parameters (params) and a feature encoder

JaxModel

(encoder).

to_standalone_cc

to_standalone_cc(
    name: str = "ydf_model",
    algorithm: Literal["IF_ELSE", "ROUTING"] = "ROUTING",
    classification_output: Literal[
        "CLASS", "SCORE", "PROBABILITY"
    ] = "CLASS",
    categorical_from_string: bool = False,
) -> Union[str, Dict[str, bytes]]

Generates standalone, dependency-free C++ code for model inference.

This method is ideal for size-critical applications. See to_cpp for an alternative with better performance and model compatibility.

How to use:

  1. Copy the generated C++ code into a .h file.
  2. In your C++ code, include the header and call the prediction function:
    #include "path/to/generated_model.h"
    using namespace <name>;
    const auto pred = Prediction(Instance{.f1=5.0, .f2=F2::kRed});
    
    The function is thread-safe.

Alternatively, you can use the cc_ydf_standalone_model Bazel rule for automated code generation (internal to Google).

  1. Save the model with model.save(...) in a directory in Google3.
  2. Create a BUILD file with a filegroup in the model directory e.g.:
    filegroup(
      name = "model",
      srcs = glob(["**"]),
    )
    
  3. In your library's BUILD, create a "cc_ydf_standalone_model " build rule.
    load("//external/ydf_cc/yggdrasil_decision_forests/serving/embed:embed.bzl",
      "cc_ydf_standalone_model ")
    cc_ydf_standalone_model (
      name = "my_model",
      classification_output = "SCORE",
      data = "<path to filegroup>",
    )
    
  4. In your cc_binary or cc_library, add ":my_model" as a dependency.
  5. In your C++ code, include:
    #include "<path to BUILD>/my_model.h"
    
    Then call:
    using namespace <name>;
    const auto pred = Prediction(Instance{.f1=5, f2=F2:kRed});
    

Parameters:

Name Type Description Default
name str

A name for the model, used to create the C++ namespace.

'ydf_model'
algorithm Literal['IF_ELSE', 'ROUTING']

The underlying algorithm for prediction. - "ROUTING" (default): Faster and produces a smaller binary. - "IF_ELSE": Generates human-readable if-else conditions.

'ROUTING'
classification_output Literal['CLASS', 'SCORE', 'PROBABILITY']

The output format for classification models. - "CLASS" (default): The predicted class index (fast). - "SCORE": The raw scores (e.g., logits) for all classes. - "PROBABILITY": The probabilities for all classes (slower, as it requires a softmax).

'CLASS'
categorical_from_string bool

If True, generates helper functions to convert strings to categorical feature enum values.

False

Returns:

Type Description
Union[str, Dict[str, bytes]]

A string with the C++ source code, or a dictionary of filename to source

Union[str, Dict[str, bytes]]

code if multiple files are generated.

to_standalone_java

to_standalone_java(
    name: str = "YdfModel",
    package_name: str = "com.example.ydfmodel",
    classification_output: Literal[
        "CLASS", "SCORE", "PROBABILITY"
    ] = "CLASS",
) -> Dict[str, bytes]

Generates standalone, dependency-free Java code for model inference.

This method is ideal for size-critical applications.

How to use:

  1. Call this function to get the generated code and data:

    model = ydf.load_model(...)
    java_files = model.to_standalone_java(
        name="MyYdfModel",
        package_name="com.mycompany.myproject"
    )
    

  2. The function returns a dictionary containing two items:

    • Key: {name}.java (e.g., "MyYdfModel.java"): Value is the Java source code as bytes.
    • Key: {name}Data.bin (e.g., "MyYdfModelData.bin"): Value is the binary model data as bytes.
  3. Save these files to your Java project:

    with open(f"{name}.java", "wb") as f:
        f.write(java_files[f"{name}.java"])
    with open(f"{name}Data.bin", "wb") as f:
        f.write(java_files[f"{name}Data.bin"])
    
    Place the {name}Data.bin file in the Java classpath, typically in the resources directory.

  4. In your Java code, import the generated class and use the static predict method:

    import com.mycompany.myproject.MyYdfModel;
    
    // Create an Instance with feature values.
    // Categorical features are represented by enums in the generated class.
    MyYdfModel.Instance instance = new MyYdfModel.Instance(
        5.0f, // Numerical feature
        MyYdfModel.FeatureF2.kRed // Categorical feature
    );
    
    // Get the prediction.
    float prediction = MyYdfModel.predict(instance);
    
    The predict function is thread-safe. The generated class also contains enums for all categorical features.

Parameters:

Name Type Description Default
name str

A name for the model, used to create the Java class name.

'YdfModel'
package_name str

The Java package name for the generated class.

'com.example.ydfmodel'
classification_output Literal['CLASS', 'SCORE', 'PROBABILITY']

The output format for classification models. - "CLASS" (default): The predicted class enum. - "SCORE": The raw scores (e.g., logits) for all classes. - "PROBABILITY": The probabilities for all classes.

'CLASS'

Returns:

Type Description
Dict[str, bytes]

A dictionary of filename to source code. This includes the Java source

Dict[str, bytes]

file and a binary resource file containing the model data.

to_tensorflow_function

to_tensorflow_function(
    temp_dir: Optional[str] = None,
    can_be_saved: bool = True,
    squeeze_binary_classification: bool = True,
    force: bool = False,
) -> Module

Converts the model into a callable TensorFlow Module (@tf.function).

This allows the YDF model to be integrated into larger TensorFlow graphs. Requires ydf-tf (pip install ydf-tf).

Note: Export to TensorFlow is not yet available for Anomaly Detection models.

Usage example:

import ydf
import numpy as np
import tensorflow as tf

# Train a model
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(100),
    "l": np.random.randint(2, size=100),
})

# Convert to a TF Module
tf_model_fn = model.to_tensorflow_function()

# Make predictions
predictions = tf_model_fn({"f1": tf.constant([0.1, 0.5, 0.9])})

Parameters:

Name Type Description Default
temp_dir Optional[str]

A temporary directory for the conversion process.

None
can_be_saved bool

If True (default), the returned module can be saved with tf.saved_model.save, and temporary files are preserved. If False, temporary files are deleted, and the module cannot be saved.

True
squeeze_binary_classification bool

If True (default), binary classification models will output a tensor of shape [num_examples] with the probability of the positive class. If False, the output is shape [num_examples, 2].

True
force bool

If True, attempts to export even in unsupported environments.

False

Returns:

Type Description
Module

A tf.Module containing the model.

to_tensorflow_saved_model

to_tensorflow_saved_model(
    path: str,
    input_model_signature_fn: Any = None,
    *,
    mode: Literal["keras", "tf"] = "tf",
    feature_dtypes: Dict[str, TFDType] = {},
    servo_api: bool = False,
    feed_example_proto: bool = False,
    pre_processing: Optional[Callable] = None,
    post_processing: Optional[Callable] = None,
    temp_dir: Optional[str] = None,
    tensor_specs: Optional[Dict[str, Any]] = None,
    feature_specs: Optional[Dict[str, Any]] = None,
    force: bool = False
) -> None

Exports the model as a TensorFlow SavedModel.

This function requires TensorFlow and the ydf-tf package to be installed. Install them by running the command pip install ydf-tf. The generated SavedModel relies on the YDF Custom Inference Op. This op is available by default in various platforms such as Servomatic, TensorFlow Serving, Vertex AI, and TensorFlow.js.

Usage example:

!pip install ydf-tf

import ydf
import numpy as np
import tensorflow as tf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100).astype(dtype=np.float32),
    "l": np.random.randint(2, size=100),
})

# Export the model to the TensorFlow SavedModel format.
# The model can be executed with Servomatic, TensorFlow Serving and
# Vertex AI.
model.to_tensorflow_saved_model(path="/tmp/my_model", mode="tf")

# The model can also be loaded in TensorFlow and executed locally.

# Load the TensorFlow Saved model.
tf_model = tf.saved_model.load("/tmp/my_model")

# Make predictions
tf_predictions = tf_model({
    "f1": tf.constant(np.random.random(size=10)),
    "f2": tf.constant(np.random.random(size=10), dtype=tf.float32),
})

TensorFlow SavedModels do not automatically cast feature values. For instance, a model trained with a dtype=float32 semantic=numerical feature, will require for this feature to be fed as float32 numbers during inference. You can override the dtype of a feature with the feature_dtypes argument:

model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    # "f1" is fed as an tf.int64 instead of tf.float64
    feature_dtypes={"f1": tf.int64},
)

Some TensorFlow Serving or Servomatic pipelines rely on feed examples as serialized TensorFlow Example proto (instead of raw tensor values) and/or wrap the model raw output (e.g. probability predictions) into a special structure (called the Serving API). You can create models compatible with those two conventions with feed_example_proto=True and servo_api=True respectively:

model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=True,
    servo_api=True
)

If your model requires some data preprocessing or post-processing, you can express them as a @tf.function or a tf module and pass them to the pre_processing and post_processing arguments respectively.

Warning: When exporting a SavedModel, YDF infers the model signature using the dtype of the features observed during training. If the signature of the pre_processing function is different than the signature of the model (e.g., the processing creates a new feature), you need to specify the tensor specs (tensor_specs; if feed_example_proto=False) or feature spec (feature_specs; if feed_example_proto=True) argument:

# Define a pre-processing function
@tf.function
def pre_processing(raw_features):
  features = {**raw_features}
  # Create a new feature.
  features["sin_f1"] = tf.sin(features["f1"])
  # Remove a feature
  del features["f1"]
  return features

# Create Numpy dataset
raw_dataset = {
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
}

# Apply the preprocessing on the training dataset.
processed_dataset = (
    tf.data.Dataset.from_tensor_slices(raw_dataset)
    .batch(128)  # The batch size has no impact on the model.
    .map(preprocessing)
    .prefetch(tf.data.AUTOTUNE)
)

# Train a model on the pre-processed dataset.
ydf_model = ydf.RandomForestLearner(
    label="l",
    task=ydf.Task.CLASSIFICATION,
).train(processed_dataset)

# Export the model to a raw SavedModel model with the pre-processing
model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=False,
    pre_processing=pre_processing,
    tensor_specs={
        "f1": tf.TensorSpec(shape=[None], name="f1", dtype=tf.float64),
        "f2": tf.TensorSpec(shape=[None], name="f2", dtype=tf.float64),
    }
)

# Export the model to a SavedModel consuming serialized tf examples with the
# pre-processing
model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=True,
    pre_processing=pre_processing,
    feature_specs={
        "f1": tf.io.FixedLenFeature(
            shape=[], dtype=tf.float32, default_value=math.nan
        ),
        "f2": tf.io.FixedLenFeature(
            shape=[], dtype=tf.float32, default_value=math.nan
        ),
    }
)

For more flexibility, use the method to_tensorflow_function instead of to_tensorflow_saved_model.

Note that export to Tensorflow is not yet available for Isolation Forest models.

Parameters:

Name Type Description Default
path str

Path to store the TensorFlow Decision Forests model.

required
input_model_signature_fn Any

A lambda that returns the (Dense,Sparse,Ragged)TensorSpec (or structure of TensorSpec e.g. dictionary, list) corresponding to input signature of the model. If not specified, the input model signature is created by tfdf.keras.build_default_input_model_signature. For example, specify input_model_signature_fn if a numerical input feature (which is consumed as DenseTensorSpec(float32) by default) will be fed differently (e.g. RaggedTensor(int64)). Only compatible with mode="keras".

None
mode Literal['keras', 'tf']

How the YDF model is converted into a TensorFlow SavedModel. 1) mode = "keras" (default): Turn the model into a Keras 2 model using TensorFlow Decision Forests, and then save it with tf_keras.models.save_model. 2) mode = "tf" (recommended; will become default): Turn the model into a TensorFlow Module, and save it with tf.saved_model.save.

'tf'
feature_dtypes Dict[str, TFDType]

Mapping from feature name to TensorFlow dtype. Use this mapping to override feature dtypes. For instance, numerical features are encoded with tf.float32 by default. If you plan on feeding tf.float64 or tf.int32, use feature_dtype to specify it. feature_dtypes is ignored if tensor_specs is set. If set, disables the automatic signature extraction on pre_processing (if pre_processing is also set). Only compatible with mode="tf".

{}
servo_api bool

If true, adds a SavedModel signature to make the model compatible with the Classify or Regress servo APIs. Only compatible with mode="tf". If false, outputs the raw model predictions.

False
feed_example_proto bool

If false, the model expects for the input features to be provided as TensorFlow values. This is the most efficient way to make predictions. If true, the model expects for the input features to be provided as a binary serialized TensorFlow Example proto. This is the format expected by VertexAI and most TensorFlow Serving pipelines.

False
pre_processing Optional[Callable]

Optional TensorFlow function or module to apply on the input features before applying the model. If the pre_processing function has been traced (i.e., the function has been called once with actual data and contains a concrete instance in its cache), this signature is extracted and used as the signature of the SavedModel. Only compatible with mode="tf".

None
post_processing Optional[Callable]

Optional TensorFlow function or module to apply on the model predictions. Only compatible with mode="tf".

None
temp_dir Optional[str]

Temporary directory used during the conversion. If None (default), uses tempfile.mkdtemp default temporary directory.

None
tensor_specs Optional[Dict[str, Any]]

Optional dictionary of tf.TensorSpec that define the input features of the model to export. If not provided, the TensorSpecs are automatically generated based on the model features seen during training. This means that "tensor_specs" is only necessary when using a "pre_processing" argument that expects different features than what the model was trained with. This argument is ignored when exporting model with feed_example_proto=True as in this case, the TensorSpecs are defined by the tf.io.parse_example parsing feature specs. Only compatible with mode="tf".

None
feature_specs Optional[Dict[str, Any]]

Optional dictionary of tf.io.parse_example parsing feature specs e.g. tf.io.FixedLenFeature or tf.io.RaggedFeature. If not provided, the parsing feature specs are automatically generated based on the model features seen during training. This means that "feature_specs" is only necessary when using a "pre_processing" argument that expects different features than what the model was trained with. This argument is ignored when exporting model with feed_example_proto=False. Only compatible with mode="tf".

None
force bool

Tries to export even in currently unsupported environments. WARNING: Setting this to true may crash the Python runtime.

False

training_logs

training_logs() -> List[TrainingLogEntry]

Returns the validation evaluation logs for the GBT model.

For Gradient Boosted Trees models, the training logs contain performance metrics calculated periodically during training on the validation dataset.

The frequency of the validation evaluation is controlled by hyperparameter validation_interval_in_trees. By default, the evaluation is computed after every tree (or after every num_classes trees for multi-class classifiation).

The returned list of TrainingLogEntry objects is sorted by iteration, allowing you to easily plot the model's learning curve. The last entry in the list represents the final validation evaluation for the fully trained model.

For multi-class classification models, an iteration trains num_classes tress. For other models, each iteration trains a single tree. If early stopping was triggered, there might be more iterations than trees in the model.

Usage example:

import pandas as pd
import ydf
from matplotlib import pyplot as plt

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

# Get the training logs
logs = model.training_logs()

# Plot the validation loss and the training loss.
plt.plot(
    [log.iteration for log in logs],
    [log.evaluation.loss for log in logs],
    label="Validation loss"
)
plt.plot(
    [log.iteration for log in logs],
    [log.training_evaluation.loss for log in logs],
    label="Training loss"
)
plt.legend()

Returns:

Type Description
List[TrainingLogEntry]

A list of TrainingLogEntry objects, each containing the validation

List[TrainingLogEntry]

evaluation metrics, training evaluation metrics and the iteration at that

List[TrainingLogEntry]

point in training.

List[TrainingLogEntry]

Returns an empty list if logs were not generated.

update_with_jax_params

update_with_jax_params(params: Dict[str, Any])

Updates the model's parameters with values from a JAX fine-tuning process.

This function allows you to take a model fine-tuned in JAX (after being exported with to_jax_function(leaves_as_params=True)) and update the original YDF model object with the new parameters.

Usage example:

import ydf
import jax

# Train a model with YDF
# dataset = ...
model = ydf.GradientBoostedTreesLearner(label="l").train(dataset)

# Convert to a JAX function with learnable parameters
jax_model = model.to_jax_function(leaves_as_params=True)

# Fine-tune the parameters in JAX
# jax_model.params = my_fine_tuning_logic(jax_model.params, ...)

# Update the YDF model with the new parameters
model.update_with_jax_params(jax_model.params)

# The YDF model now reflects the fine-tuning
# model.save("/path/to/finetuned_model")

Parameters:

Name Type Description Default
params Dict[str, Any]

A dictionary of model parameters, as produced by to_jax_function.

required

validation_evaluation

validation_evaluation() -> Optional[Evaluation]

Returns the validation evaluation of the model, if available.

Gradient Boosted Trees use a validation dataset for early stopping.

Returns None if no validation evaluation has been computed or it has been removed from the model.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

validation_evaluation = model.validation_evaluation()
# In an interactive Python environment, print a rich evaluation report.
validation_evaluation

validation_loss

validation_loss() -> Optional[float]

Returns loss on the validation dataset if available.

variable_importances

variable_importances() -> (
    Dict[str, List[Tuple[float, str]]]
)

Returns the variable importances (VIs) of the model.

Variable importances indicate how much each feature contributes to the model's predictions. Different VI metrics have different semantics and are generally not comparable.

The available VIs depend on the learning algorithm and its hyperparameters. For example, for Random Forest, setting compute_oob_variable_importances=True enables the computation of permutation out-of-bag VIs.

Usage example:

# Train a Random Forest and enable OOB VI computation.
learner = ydf.RandomForestLearner(
    label="species", compute_oob_variable_importances=True
)
model = learner.train(dataset)

# List available VI metrics.
print(model.variable_importances().keys())
# dict_keys(['NUM_AS_ROOT', 'SUM_SCORE', 'MEAN_DECREASE_IN_ACCURACY'])

# Get a specific VI, sorted by importance.
vi = model.variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
# [('bill_length_mm', 0.0713), ('island', 0.0072), ...]

Returns:

Type Description
Dict[str, List[Tuple[float, str]]]

A dictionary where keys are the names of the VI metrics and values are

Dict[str, List[Tuple[float, str]]]

lists of (importance_value, feature_name) tuples, sorted in descending

Dict[str, List[Tuple[float, str]]]

order of importance.