IsolationForestModel

IsolationForestModel ¶

IsolationForestModel(raw_model: GenericCCModel)

Bases: DecisionForestModel

An Isolation Forest model for prediction and inspection.

add_tree ¶

add_tree(tree: Tree) -> None

Adds a single tree of the model.

Parameters:

Name	Type	Description	Default
`tree`	`Tree`	New tree.	required

analyze ¶

analyze(
    data: InputDataset,
    sampling: float = 1.0,
    num_bins: int = 50,
    partial_dependence_plot: bool = True,
    conditional_expectation_plot: bool = True,
    permutation_variable_importance: bool = True,
    shap_values: bool = True,
    permutation_variable_importance_rounds: int = 1,
    num_threads: Optional[int] = None,
    maximum_duration: Optional[float] = 20,
    features: Optional[List[str]] = None,
) -> Analysis

Analyzes the model's structure and its behavior on a dataset.

An analysis includes structural information (e.g., variable importances) and performance characteristics on the given dataset (e.g., partial dependence plots). Computing the analysis can be time-consuming on large datasets. It is generally recommended to run analysis on a test set, not the training set.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Analyze the model on a test set
test_ds = pd.read_csv("test.csv")
analysis = model.analyze(test_ds)

# Display the analysis report in a notebook
analysis

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	The dataset for analysis.	required
`sampling`	`float`	The fraction of examples to use for the analysis (e.g., 0.1 for 10%). On large datasets, a smaller sample can significantly speed up computation.	`1.0`
`num_bins`	`int`	The number of bins for accumulating statistics in plots. More bins provide higher resolution but take longer to compute.	`50`
`partial_dependence_plot`	`bool`	If `True`, computes Partial Dependence Plots (PDPs), which can be computationally expensive.	`True`
`conditional_expectation_plot`	`bool`	If `True`, computes Conditional Expectation Plots (CEPs), which are computationally cheap.	`True`
`permutation_variable_importance`	`bool`	If `True`, computes permutation variable importance.	`True`
`shap_values`	`bool`	If `True`, computes SHAP-based metrics.	`True`
`permutation_variable_importance_rounds`	`int`	The number of rounds for permutation variable importance. More rounds increase accuracy but take longer. A value of 1 is often sufficient. Set to 0 to disable.	`1`
`num_threads`	`Optional[int]`	The number of threads to use. Defaults to the number of available CPU cores.	`None`
`maximum_duration`	`Optional[float]`	The approximate maximum duration of the analysis in seconds. The analysis may run slightly longer.	`20`
`features`	`Optional[List[str]]`	If specified, PDP and CEP plots will be limited to these features and displayed in this order.	`None`

Returns:

Type	Description
`Analysis`	An `Analysis` object containing the results.

analyze_prediction ¶

analyze_prediction(
    single_example: InputDataset,
    features: Optional[List[str]] = None,
) -> PredictionAnalysis

Explains a single prediction of the model.

This method shows how each feature value contributed to the final prediction for a specific example. For a global model analysis, use model.analyze() instead.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Explain the prediction for the first example in the test set
test_ds = pd.read_csv("test.csv")
first_example = test_ds.iloc[:1]
explanation = model.analyze_prediction(first_example)

# Display the explanation in a notebook.
explanation

Parameters:

Name	Type	Description	Default
`single_example`	`InputDataset`	A dataset containing a single example to explain.	required
`features`	`Optional[List[str]]`	If specified, the analysis will be limited to these features, and they will be displayed in the specified order.	`None`

Returns:

Type	Description
`PredictionAnalysis`	A `PredictionAnalysis` object containing the explanation.

benchmark ¶

benchmark(
    ds: InputDataset,
    benchmark_duration: float = 3,
    warmup_duration: float = 1,
    batch_size: int = 100,
    num_threads: Optional[int] = None,
) -> BenchmarkInferenceCCResult

Benchmarks the inference speed of the model on a given dataset.

This method measures the time it takes to run predictions on the dataset using the Yggdrasil Decision Forests C++ engine. Note that inference times may vary on different machines or with other APIs. A C++ serving template can be generated with model.to_cpp().

Parameters:

Name	Type	Description	Default
`ds`	`InputDataset`	The dataset to use for benchmarking.	required
`benchmark_duration`	`float`	The target duration of the benchmark in seconds. The actual duration may be slightly different. Must be > 0.	`3`
`warmup_duration`	`float`	The target duration of the warmup phase in seconds. During this phase, predictions are run but not timed, to warm up caches. Must be > 0.	`1`
`batch_size`	`int`	The number of examples to process in each batch. The impact of this parameter depends on the machine's architecture (e.g., cache sizes).	`100`
`num_threads`	`Optional[int]`	The number of threads to use for the benchmark. If not specified, it defaults to the number of available CPU cores.	`None`

Returns:

Type	Description
`BenchmarkInferenceCCResult`	An object containing the benchmark results.

Raises:

Type	Description
`ValueError`	If `benchmark_duration`, `warmup_duration`, or `batch_size` are not positive.

data_spec ¶

data_spec() -> DataSpecification

The data specification of the dataset used to train the model.

Returns:

Type	Description
`DataSpecification`	A DataSpecification protobuf object.

describe ¶

describe(
    output_format: Literal[
        "auto", "text", "notebook", "html"
    ] = "auto",
    full_details: bool = False,
) -> Union[str, HtmlNotebookDisplay]

Generates a textual or HTML description of the model.

Parameters:

Name	Type	Description	Default
`output_format`	`Literal['auto', 'text', 'notebook', 'html']`	The format of the output. - "auto": "notebook" in an IPython notebook, "text" otherwise. - "text": A plain text description. - "html": A standalone HTML description. - "notebook": An HTML description for display in a notebook cell.	`'auto'`
`full_details`	`bool`	If `True`, the full model structure is included, which can be very large.	`False`

Returns:

Type	Description
`Union[str, HtmlNotebookDisplay]`	The model description as a string or an HTML display object.

distance ¶

distance(
    data1: InputDataset,
    data2: Optional[InputDataset] = None,
) -> ndarray

Computes the pairwise distance between examples in "data1" and "data2".

If "data2" is not provided, computes the pairwise distance between examples in "data1".

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.

Different models are free to implement different distances with different definitions. For this reason, unless indicated by the model, distances from different models cannot be compared.

The distance is not guaranteed to satisfy the triangular inequality property of metric distances.

Not all models can compute distances. In this case, this function will raise an Exception.

Parameters:

Name	Type	Description	Default
`data1`	`InputDataset`	Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.	required
`data2`	`Optional[InputDataset]`	Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.	`None`

Returns:

Type	Description
`ndarray`	Pairwise distance

evaluate ¶

evaluate(
    data: InputDataset,
    *,
    weighted: Optional[bool] = None,
    task: Optional[Task] = None,
    label: Optional[str] = None,
    group: Optional[str] = None,
    bootstrapping: Union[bool, int] = False,
    ndcg_truncation: int = 5,
    mrr_truncation: int = 5,
    map_truncation: int = 5,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> Evaluation

Evaluates the quality of a model on a dataset.

In a notebook environment, the returned Evaluation object is displayed as a rich HTML report with plots.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Evaluate the model on a test dataset
test_ds = pd.read_csv("test.csv")
evaluation = model.evaluate(test_ds)

# Display the evaluation report in a notebook
evaluation

You can also evaluate the model on a different task than it was trained for, by overriding the task, label, and group arguments.

# Train a regression model
model = ydf.RandomForestLearner(label="price",
task=ydf.Task.REGRESSION).train(...)

# Evaluate it as a ranking model
ranking_evaluation = model.evaluate(
    test_ds, task=ydf.Task.RANKING, group="session_id"
)

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	The dataset for evaluation.	required
`weighted`	`Optional[bool]`	If `True`, the evaluation is weighted using the training weights. If `False`, it is unweighted. If `None` (default), it defaults to `False` with a warning if the model was trained with weights. The default value will change to `True` in a future version.	`None`
`task`	`Optional[Task]`	Overrides the model's task for this evaluation. Defaults to the model's original task.	`None`
`label`	`Optional[str]`	Overrides the label column for this evaluation. Defaults to the model's original label.	`None`
`group`	`Optional[str]`	Overrides the grouping column for this evaluation, used for ranking tasks. Defaults to the model's original group column.	`None`
`bootstrapping`	`Union[bool, int]`	If `True`, enables bootstrapping with 2000 samples to compute confidence intervals and statistical tests. If an integer (>= 100) is provided, it specifies the number of samples.	`False`
`ndcg_truncation`	`int`	The truncation level for the NDCG metric.	`5`
`mrr_truncation`	`int`	The truncation level for the MRR metric.	`5`
`map_truncation`	`int`	The truncation level for the MAP metric.	`5`
`use_slow_engine`	`bool`	If `True`, uses a slower, more robust inference engine. See `predict()` for details.	`False`
`num_threads`	`Optional[int]`	The number of threads to use. Defaults to the number of available CPU cores.	`None`

Returns:

Type	Description
`Evaluation`	An `Evaluation` object containing the model's performance metrics.

feature_selection_logs ¶

feature_selection_logs() -> Optional[FeatureSelectorLogs]

Retrieves the feature selection logs, if available.

Returns:

Type	Description
`Optional[FeatureSelectorLogs]`	The feature selection logs, or `None` if they are not available.

force_engine ¶

force_engine(engine_name: Optional[str]) -> None

Forces the model to use a specific inference engine.

By default (engine_name=None), the model automatically uses the fastest compatible engine. This method allows you to override that behavior.

If an invalid or incompatible engine name is provided, subsequent calls to predict(), evaluate(), etc., will fail.

Parameters:

Name	Type	Description	Default
`engine_name`	`Optional[str]`	The name of a compatible engine, or `None` to restore automatic selection.	required

get_all_trees ¶

get_all_trees() -> Sequence[Tree]

Returns all the trees in the model.

get_tree ¶

get_tree(tree_idx: int) -> Tree

Gets a single tree of the model.

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, num_trees()).	required

Returns:

Type	Description
`Tree`	The tree.

hyperparameter_optimizer_logs ¶

hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]

Returns the logs of the hyperparameter tuning process, if any.

Returns:

Type	Description
`Optional[OptimizerLogs]`	An `OptimizerLogs` object containing the tuning trials, or `None` if the
`Optional[OptimizerLogs]`	model was not trained with hyperparameter tuning.

input_feature_names ¶

input_feature_names() -> List[str]

Returns the names of the input features.

The feature names are sorted by their column index in the data specification.

Returns:

Type	Description
`List[str]`	A list of feature name strings.

input_features ¶

input_features() -> Sequence[InputFeature]

Returns the input features of the model.

The features are sorted by their column index in the data specification.

Returns:

Type	Description
`Sequence[InputFeature]`	A list of `InputFeature` objects.

input_features_col_idxs ¶

input_features_col_idxs() -> Sequence[int]

Returns the column indices of the input features in the dataspec.

iter_trees ¶

iter_trees() -> Iterator[Tree]

Returns an iterator over all the trees in the model.

label ¶

label() -> Optional[str]

Returns the name of the label column.

Returns:

Type	Description
`Optional[str]`	The label column name as a string, or `None` if the model has no label.

label_classes ¶

label_classes() -> List[str]

Returns the list of possible label values for a classification model.

The order of the classes in the returned list corresponds to the order of probabilities in the output of model.predict().

Returns:

Type	Description
`List[str]`	A list of class name strings.

Raises:

Type	Description
`ValueError`	If the model is not a classification model.

label_col_idx ¶

label_col_idx() -> int

Returns the index of the label column in the dataspec.

Returns:

Type	Description
`int`	The column index, or -1 if the model has no label.

list_compatible_engines ¶

list_compatible_engines() -> Sequence[str]

Lists the inference engines compatible with the model.

The engines are sorted from likely-fastest to likely-slowest.

Returns:

Type	Description
`Sequence[str]`	A list of names of compatible inference engines.

metadata ¶

metadata() -> ModelMetadata

Metadata associated with the model.

A model's metadata contains information that does not influence its predictions, such as the creation time. When distributing a model for wide release, it may be useful to clear or modify the metadata.

Example:

# Clear the metadata
model.set_metadata(ydf.ModelMetadata())

Returns:

Type	Description
`ModelMetadata`	The model's metadata object.

name ¶

name() -> str

Returns the name of the model type (e.g., "RANDOM_FOREST").

num_examples_per_tree ¶

num_examples_per_tree() -> int

Returns the number of examples used to grow each tree.

num_nodes ¶

num_nodes() -> int

Returns the number of nodes in the decision forest.

num_trees ¶

num_trees() -> int

Returns the number of trees in the decision forest.

plot_tree ¶

plot_tree(
    tree_idx: int = 0,
    max_depth: Optional[int] = None,
    options: Optional[PlotOptions] = None,
    d3js_url: str = "https://d3js.org/d3.v6.min.js",
) -> TreePlot

Plots an interactive HTML rendering of the tree.

Usage example:

import pandas as pd
import ydf

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, self.num_trees()).	`0`
`max_depth`	`Optional[int]`	Maximum tree depth of the plot. Set to None for full depth.	`None`
`options`	`Optional[PlotOptions]`	Advanced options for plotting. Set to None for default style.	`None`
`d3js_url`	`str`	URL to load the d3.js library from.	`'https://d3js.org/d3.v6.min.js'`

Returns:

Type	Description
`TreePlot`	In interactive environments, an interactive plot. The HTML source can also
`TreePlot`	be exported to file.

predict ¶

predict(
    data: InputDataset,
    *,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> ndarray

Runs the model on a dataset and returns its predictions.

The output is a NumPy array of float32 values. The structure of this array depends on the model's task. See the "Returns" section for details.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Get predictions on a test dataset
test_ds = pd.read_csv("test.csv")
predictions = model.predict(test_ds)

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	The dataset to make predictions on. Can be a pandas DataFrame, a dictionary of NumPy arrays, a path to a file, etc. If the dataset contains the label column, it will be ignored.	required
`use_slow_engine`	`bool`	If `True`, uses a slower, more robust inference engine. This is a fallback for rare edge cases where the default engines might fail (e.g., models with a very large number of categorical conditions). If you encounter such a case, please report it to the YDF developers.	`False`
`num_threads`	`Optional[int]`	The number of threads to use for prediction. If `None`, it defaults to the number of available CPU cores.	`None`

Returns:

Name	Type	Description
	`ndarray`	A NumPy array containing the predictions. The shape and content vary by
`task`	`ndarray`
	`ndarray`	`Task.CLASSIFICATION`: Binary Classification (2 classes): An array of shape `[num_examples]`. Each value is the probability of the positive class (at `model.label_classes()[1]`). The probability of the negative class is `1 - prediction`. Multi-class Classification (>2 classes): An array of shape `[num_examples, num_classes]`. Each row contains the probabilities for each class, in the order of `model.label_classes()`.
	`ndarray`	`Task.REGRESSION`: An array of shape `[num_examples]`, where each value is the predicted numerical outcome.
	`ndarray`	`Task.RANKING`: An array of shape `[num_examples]`, where each value is the predicted score for the item. Higher scores indicate higher rank.
	`ndarray`	`Task.CATEGORICAL_UPLIFT` and `Task.NUMERICAL_UPLIFT`: An array
	`ndarray`	of shape `[num_examples]`. Each value is the predicted uplift, representing the incremental effect of the treatment.
	`ndarray`	`Task.ANOMALY_DETECTION`: An array of shape `[num_examples]`, where each value is the anomaly score (0 for most normal, 1 for most anomalous).

predict_class ¶

predict_class(
    data: InputDataset,
    *,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> ndarray

Returns the most likely predicted class for a classification model.

This is a convenience method for classification tasks. It returns a NumPy array of strings representing the predicted class for each example. In case of a tie in probabilities, the class that appears first in model.label_classes() is chosen.

For the full class probabilities, use model.predict().

Usage example:

import pandas as pd
import ydf

# Train a classification model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="category").train(train_ds)

# Get the predicted class for each example
test_ds = pd.read_csv("test.csv")
predicted_classes = model.predict_class(test_ds)

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	The dataset to make predictions on.	required
`use_slow_engine`	`bool`	If `True`, uses a slower, more robust inference engine. See `predict()` for details.	`False`
`num_threads`	`Optional[int]`	The number of threads to use. Defaults to the number of available CPU cores.	`None`

Returns:

Type	Description
`ndarray`	A NumPy array of strings of shape `[num_examples]`, containing the most
`ndarray`	likely predicted class for each example.

Raises:

Type	Description
`ValueError`	If the model is not a classification model.

predict_leaves ¶

predict_leaves(data: InputDataset) -> ndarray

Gets the index of the active leaf in each tree.

The active leaf is the leaf that receives the example during inference.

The returned value "leaves[i,j]" is the index of the active leaf for the i-th example and the j-th tree. Leaves are indexed by depth first exploration with the negative child visited before the positive one.

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	Dataset.	required

Returns:

Type	Description
`ndarray`	Index of the active leaf for each tree in the model.

predict_shap ¶

predict_shap(
    data: InputDataset, *, num_threads: Optional[int] = None
) -> Tuple[Dict[str, ndarray], ndarray]

Computes SHAP values for each example in the given dataset.

SHAP (SHapley Additive exPlanations) values explain a prediction by attributing the outcome to each feature. The sum of an example's SHAP values plus the model's initial prediction (initial_value) equals the model's raw prediction (before any activation function like sigmoid).

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

# Compute SHAP values on the test dataset
test_ds = pd.read_csv("test.csv")
shap_values, initial_value = model.predict_shap(test_ds)

Parameters:

Name	Type	Description	Default
`data`	`InputDataset`	The dataset to compute SHAP values for. If it contains the label column, it will be ignored.	required
`num_threads`	`Optional[int]`	The number of threads to use. Defaults to the number of available CPU cores.	`None`

Returns:

Type	Description
`Tuple[Dict[str, ndarray], ndarray]`	A tuple `(shap_values, initial_value)` where: - `shap_values`: A dictionary mapping feature names to NumPy arrays. Each array has a shape of `[num_examples]` or `[num_examples, num_outputs]`, containing the SHAP values for that feature. - `initial_value`: A NumPy array of shape `[]` or `[num_outputs]` representing the model's initial prediction (i.e., offset).

print_tree ¶

print_tree(
    tree_idx: int = 0,
    max_depth: Optional[int] = 6,
    file: Any = stdout,
) -> None

Prints a tree in the terminal.

Usage example:

import pandas as pd
import ydf

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, self.num_trees()).	`0`
`max_depth`	`Optional[int]`	Maximum tree depth of the plot. Set to None for full depth.	`6`
`file`	`Any`	Where to print the tree. By default, prints on the terminal standard output.	`stdout`

remove_tree ¶

remove_tree(tree_idx: int) -> None

Removes a single tree of the model.

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, num_trees()).	required

save ¶

save(
    path: str,
    advanced_options=ModelIOOptions(),
    *,
    pure_serving=False
) -> None

Saves the model to a directory.

YDF uses a proprietary format consisting of multiple files in a single directory. This directory should ideally contain only one model.

YDF models can also be exported to other formats, such as TensorFlow SavedModel (to_tensorflow_saved_model()) or C++ code (to_cpp()).

The model may contain metadata (see model.metadata()). Before distributing a model, consider clearing this metadata: model.set_metadata(ydf.ModelMetadata()).

Usage example:

import pandas as pd
import ydf

# Train a Random Forest model
df = pd.read_csv("my_dataset.csv")
model = ydf.RandomForestLearner(label="my_label").train(df)

# Save the model to disk
model.save("/models/my_model")

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to the directory where the model will be saved.	required
`advanced_options`	`ModelIOOptions`	Advanced options for saving the model.	`ModelIOOptions()`
`pure_serving`	`bool`	If `True`, saves a smaller version of the model suitable for serving by removing training-specific metadata and debug information. This might require more memory during the saving process, but the resulting model on disk will be smaller.	`False`

self_evaluation ¶

self_evaluation() -> Evaluation

Returns the model's self-evaluation, computed during training.

The method of self-evaluation depends on the model type. For example, Random Forests use out-of-bag (OOB) evaluation, while Gradient Boosted Trees use evaluation on a validation dataset. Because of this, self- evaluations are not directly comparable between different model types.

Usage example:

import pandas as pd
import ydf

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

# Get the self-evaluation
self_evaluation = model.self_evaluation()

# In a notebook, this will print a rich report.
self_evaluation

Returns:

Type	Description
`Evaluation`	An `Evaluation` object with the metrics.

serialize ¶

serialize() -> bytes

Serializes the model into a bytes object.

A serialized model is equivalent to a model saved with model.save(). It may contain metadata related to training and interpretation. To minimize its size, you can train with the pure_serving_model=True option in the learner.

Usage example:

import pandas as pd
import ydf

# Create and train a model
dataset = pd.DataFrame({"feature": [0, 1], "label": [0, 1]})
learner = ydf.RandomForestLearner(label="label")
model = learner.train(dataset)

# Serialize the model to a bytes object
serialized_model = model.serialize()

# Deserialize the model
deserialized_model = ydf.deserialize_model(serialized_model)

# Make predictions with both models
predictions = model.predict(dataset)
deserialized_predictions = deserialized_model.predict(dataset)

Returns:

Type	Description
`bytes`	The serialized model as a `bytes` object.

set_data_spec ¶

set_data_spec(data_spec: DataSpecification) -> None

Updates the data specification of the model.

This is an advanced feature and should be used with caution, as it can easily lead to a broken model.

Parameters:

Name	Type	Description	Default
`data_spec`	`DataSpecification`	The new DataSpecification protobuf object.	required

set_feature_selection_logs ¶

set_feature_selection_logs(
    value: Optional[FeatureSelectorLogs],
) -> None

Sets the feature selection logs for the model.

Parameters:

Name	Type	Description	Default
`value`	`Optional[FeatureSelectorLogs]`	The feature selection logs to set, or `None` to clear them.	required

set_metadata ¶

set_metadata(metadata: ModelMetadata)

Updates the model's metadata.

Parameters:

Name	Type	Description	Default
`metadata`	`ModelMetadata`	The new metadata object for the model.	required

set_node_format ¶

set_node_format(node_format: NodeFormat) -> None

Set the serialization format for the nodes.

Parameters:

Name	Type	Description	Default
`node_format`	`NodeFormat`	Node format to use when saving the model.	required

set_tree ¶

set_tree(tree_idx: int, tree: Tree) -> None

Overrides a single tree of the model.

Parameters:

Name	Type	Description	Default
`tree_idx`	`int`	Index of the tree. Should be in [0, num_trees()).	required
`tree`	`Tree`	New tree.	required

task ¶

task() -> Task

The task the model is trained to solve.

Returns:

Type	Description
`Task`	The task enum for this model.

to_cpp ¶

to_cpp(key: str = 'my_model') -> str

Generates C++ code (.h file) for running the model.

This method provides a fast and widely compatible way to deploy YDF models in C++. For applications where binary size is critical, to_standalone_cc is an alternative that produces much smaller binaries with zero dependencies, but may be slower and less compatible with all model types.

How to use:

Generate the header file: open("model.h", "w").write(model.to_cpp())

In your Bazel/Blaze BUILD file, add the necessary dependencies:

//third_party/absl/status:statusor
//third_party/absl/strings
//external/ydf_cc/yggdrasil_decision_forests/api:serving

In your C++ code, include the header and use the model:

#include "path/to/model.h"
#include "yggdrasil_decision_forests/api/serving.h"

namespace ydf = yggdrasil_decision_forests;
// Load the model once.
const auto model = ydf::exported_model_123::LoadModel("<path to model
dir>");
// Run predictions.
predictions = model.Predict(...);
...

The generated Predict function uses placeholder values for features. You will need to modify this function to accept your own input data and populate the examples->Set(...) calls accordingly.
For optimal performance, pre-allocate and reuse the examples and predictions objects for each thread.

The generated file contains further documentation.

Parameters:

Name	Type	Description	Default
`key`	`str`	A name for the model, used to create a unique C++ namespace.	`'my_model'`

Returns:

Type	Description
`str`	A string containing the C++ header code.

to_docker ¶

to_docker(path: str, exist_ok: bool = False) -> None

Exports the model as a self-contained Docker endpoint for deployment.

This function creates a directory with a Dockerfile, the model, and all necessary support files to serve the model over an HTTP endpoint.

Usage example:

import ydf
import numpy as np

# Train a model
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
})

# Export the model to a Docker endpoint directory
model.to_docker(path="/tmp/my_docker_model")

# See the generated README for instructions
!cat /tmp/my_docker_model/readme.md

# Test the end-point locally
docker build --platform linux/amd64 -t ydf_predict_image /tmp/my_model
docker run --rm -p 8080:8080 -d ydf_predict_image

# Deploy the model on Google Cloud
gcloud run deploy ydf-predict --source /tmp/my_model

# Check the automatically created utility scripts "test_locally.sh" and
# "deploy_in_google_cloud.sh" for more examples.

Parameters:

Name	Type	Description	Default
`path`	`str`	The directory where the Docker endpoint files will be created.	required
`exist_ok`	`bool`	If `False` (default), raises an error if the `path` directory already exists. If `True`, overwrites the content of the directory if it exists.	`False`

to_jax_function ¶

to_jax_function(
    jit: bool = True,
    apply_activation: bool = True,
    leaves_as_params: bool = False,
    compatibility: Union[str, Compatibility] = "XLA",
) -> JaxModel

Converts the model into a JAX function for use in JAX ecosystems.

Usage example:

import ydf
import numpy as np
import jax.numpy as jnp

# Train a model
model = ydf.GradientBoostedTreesLearner(label="l").train({
    "f1": np.random.random(100),
    "l": np.random.randint(2, 100),
})

# Convert to a JAX function
jax_model = model.to_jax_function()

# Make predictions
predictions = jax_model.predict({
    "f1": jnp.array([0.1, 0.5, 0.9]),
})

Parameters:

Name	Type	Description	Default
`jit`	`bool`	If `True`, the returned function will be just-in-time compiled with `@jax.jit`.	`True`
`apply_activation`	`bool`	If `True`, the model's activation function (e.g., sigmoid) will be applied to the output.	`True`
`leaves_as_params`	`bool`	If `True`, the model's leaf values are exported as learnable parameters. The returned object will contain a `params` attribute, which must be passed to the `predict` function. This is useful for fine-tuning.	`False`
`compatibility`	`Union[str, Compatibility]`	The JAX runtime compatibility. Can be "XLA" (default) or "TFL" (for TensorFlow Lite).	`'XLA'`

Returns:

Type	Description
`JaxModel`	A dataclass containing the JAX prediction function (`predict`), and
`JaxModel`	optionally the model parameters (`params`) and a feature encoder
`JaxModel`	(`encoder`).

to_standalone_cc ¶

to_standalone_cc(
    name: str = "ydf_model",
    algorithm: Literal["IF_ELSE", "ROUTING"] = "ROUTING",
    classification_output: Literal[
        "CLASS", "SCORE", "PROBABILITY"
    ] = "CLASS",
    categorical_from_string: bool = False,
) -> Union[str, Dict[str, bytes]]

Generates standalone, dependency-free C++ code for model inference.

This method is ideal for size-critical applications. See to_cpp for an alternative with better performance and model compatibility.

How to use:

Copy the generated C++ code into a .h file.

In your C++ code, include the header and call the prediction function:

#include "path/to/generated_model.h"
using namespace <name>;
const auto pred = Prediction(Instance{.f1=5.0, .f2=F2::kRed});

The function is thread-safe.

Alternatively, you can use the cc_ydf_standalone_model Bazel rule for automated code generation (internal to Google).

Save the model with model.save(...) in a directory in Google3.
Create a BUILD file with a filegroup in the model directory e.g.:
```
filegroup(
  name = "model",
  srcs = glob(["**"]),
)
```

In your library's BUILD, create a "cc_ydf_standalone_model " build rule.

load("//external/ydf_cc/yggdrasil_decision_forests/serving/embed:embed.bzl",
  "cc_ydf_standalone_model ")
cc_ydf_standalone_model (
  name = "my_model",
  classification_output = "SCORE",
  data = "<path to filegroup>",
)

In your cc_binary or cc_library, add ":my_model" as a dependency.

In your C++ code, include:

#include "<path to BUILD>/my_model.h"

Then call:

using namespace <name>;
const auto pred = Prediction(Instance{.f1=5, f2=F2:kRed});

Parameters:

Name	Type	Description	Default
`name`	`str`	A name for the model, used to create the C++ namespace.	`'ydf_model'`
`algorithm`	`Literal['IF_ELSE', 'ROUTING']`	The underlying algorithm for prediction. - "ROUTING" (default): Faster and produces a smaller binary. - "IF_ELSE": Generates human-readable if-else conditions.	`'ROUTING'`
`classification_output`	`Literal['CLASS', 'SCORE', 'PROBABILITY']`	The output format for classification models. - "CLASS" (default): The predicted class index (fast). - "SCORE": The raw scores (e.g., logits) for all classes. - "PROBABILITY": The probabilities for all classes (slower, as it requires a softmax).	`'CLASS'`
`categorical_from_string`	`bool`	If `True`, generates helper functions to convert strings to categorical feature enum values.	`False`

Returns:

Type	Description
`Union[str, Dict[str, bytes]]`	A string with the C++ source code, or a dictionary of filename to source
`Union[str, Dict[str, bytes]]`	code if multiple files are generated.

to_standalone_java ¶

to_standalone_java(
    name: str = "YdfModel",
    package_name: str = "com.example.ydfmodel",
    classification_output: Literal[
        "CLASS", "SCORE", "PROBABILITY"
    ] = "CLASS",
) -> Dict[str, bytes]

Generates standalone, dependency-free Java code for model inference.

This method is ideal for size-critical applications.

How to use:

Call this function to get the generated code and data:

model = ydf.load_model(...)
java_files = model.to_standalone_java(
    name="MyYdfModel",
    package_name="com.mycompany.myproject"
)

The function returns a dictionary containing two items:
- Key: {name}.java (e.g., "MyYdfModel.java"): Value is the Java source code as bytes.
- Key: {name}Data.bin (e.g., "MyYdfModelData.bin"): Value is the binary model data as bytes.

Save these files to your Java project:

with open(f"{name}.java", "wb") as f:
    f.write(java_files[f"{name}.java"])
with open(f"{name}Data.bin", "wb") as f:
    f.write(java_files[f"{name}Data.bin"])

Place the {name}Data.bin file in the Java classpath, typically in the resources directory.

In your Java code, import the generated class and use the static predict method:

import com.mycompany.myproject.MyYdfModel;

// Create an Instance with feature values.
// Categorical features are represented by enums in the generated class.
MyYdfModel.Instance instance = new MyYdfModel.Instance(
    5.0f, // Numerical feature
    MyYdfModel.FeatureF2.kRed // Categorical feature
);

// Get the prediction.
float prediction = MyYdfModel.predict(instance);

The predict function is thread-safe. The generated class also contains enums for all categorical features.

Parameters:

Name	Type	Description	Default
`name`	`str`	A name for the model, used to create the Java class name.	`'YdfModel'`
`package_name`	`str`	The Java package name for the generated class.	`'com.example.ydfmodel'`
`classification_output`	`Literal['CLASS', 'SCORE', 'PROBABILITY']`	The output format for classification models. - "CLASS" (default): The predicted class enum. - "SCORE": The raw scores (e.g., logits) for all classes. - "PROBABILITY": The probabilities for all classes.	`'CLASS'`

Returns:

Type	Description
`Dict[str, bytes]`	A dictionary of filename to source code. This includes the Java source
`Dict[str, bytes]`	file and a binary resource file containing the model data.

to_tensorflow_function ¶

to_tensorflow_function(
    temp_dir: Optional[str] = None,
    can_be_saved: bool = True,
    squeeze_binary_classification: bool = True,
    force: bool = False,
) -> Module

Converts the model into a callable TensorFlow Module (@tf.function).

This allows the YDF model to be integrated into larger TensorFlow graphs. Requires ydf-tf (pip install ydf-tf).

Note: Export to TensorFlow is not yet available for Anomaly Detection models.

Usage example:

import ydf
import numpy as np
import tensorflow as tf

# Train a model
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(100),
    "l": np.random.randint(2, size=100),
})

# Convert to a TF Module
tf_model_fn = model.to_tensorflow_function()

# Make predictions
predictions = tf_model_fn({"f1": tf.constant([0.1, 0.5, 0.9])})

Parameters:

Name	Type	Description	Default
`temp_dir`	`Optional[str]`	A temporary directory for the conversion process.	`None`
`can_be_saved`	`bool`	If `True` (default), the returned module can be saved with `tf.saved_model.save`, and temporary files are preserved. If `False`, temporary files are deleted, and the module cannot be saved.	`True`
`squeeze_binary_classification`	`bool`	If `True` (default), binary classification models will output a tensor of shape `[num_examples]` with the probability of the positive class. If `False`, the output is shape `[num_examples, 2]`.	`True`
`force`	`bool`	If `True`, attempts to export even in unsupported environments.	`False`

Returns:

Type	Description
`Module`	A `tf.Module` containing the model.

to_tensorflow_saved_model ¶

to_tensorflow_saved_model(
    path: str,
    input_model_signature_fn: Any = None,
    *,
    mode: Literal["keras", "tf"] = "tf",
    feature_dtypes: Dict[str, TFDType] = {},
    servo_api: bool = False,
    feed_example_proto: bool = False,
    pre_processing: Optional[Callable] = None,
    post_processing: Optional[Callable] = None,
    temp_dir: Optional[str] = None,
    tensor_specs: Optional[Dict[str, Any]] = None,
    feature_specs: Optional[Dict[str, Any]] = None,
    force: bool = False
) -> None

Exports the model as a TensorFlow SavedModel.

This function requires TensorFlow and the ydf-tf package to be installed. Install them by running the command pip install ydf-tf. The generated SavedModel relies on the YDF Custom Inference Op. This op is available by default in various platforms such as Servomatic, TensorFlow Serving, Vertex AI, and TensorFlow.js.

Usage example:

!pip install ydf-tf

import ydf
import numpy as np
import tensorflow as tf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100).astype(dtype=np.float32),
    "l": np.random.randint(2, size=100),
})

# Export the model to the TensorFlow SavedModel format.
# The model can be executed with Servomatic, TensorFlow Serving and
# Vertex AI.
model.to_tensorflow_saved_model(path="/tmp/my_model", mode="tf")

# The model can also be loaded in TensorFlow and executed locally.

# Load the TensorFlow Saved model.
tf_model = tf.saved_model.load("/tmp/my_model")

# Make predictions
tf_predictions = tf_model({
    "f1": tf.constant(np.random.random(size=10)),
    "f2": tf.constant(np.random.random(size=10), dtype=tf.float32),
})

TensorFlow SavedModels do not automatically cast feature values. For instance, a model trained with a dtype=float32 semantic=numerical feature, will require for this feature to be fed as float32 numbers during inference. You can override the dtype of a feature with the feature_dtypes argument:

model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    # "f1" is fed as an tf.int64 instead of tf.float64
    feature_dtypes={"f1": tf.int64},
)

Some TensorFlow Serving or Servomatic pipelines rely on feed examples as serialized TensorFlow Example proto (instead of raw tensor values) and/or wrap the model raw output (e.g. probability predictions) into a special structure (called the Serving API). You can create models compatible with those two conventions with feed_example_proto=True and servo_api=True respectively:

model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=True,
    servo_api=True
)

If your model requires some data preprocessing or post-processing, you can express them as a @tf.function or a tf module and pass them to the pre_processing and post_processing arguments respectively.

Warning: When exporting a SavedModel, YDF infers the model signature using the dtype of the features observed during training. If the signature of the pre_processing function is different than the signature of the model (e.g., the processing creates a new feature), you need to specify the tensor specs (tensor_specs; if feed_example_proto=False) or feature spec (feature_specs; if feed_example_proto=True) argument:

# Define a pre-processing function
@tf.function
def pre_processing(raw_features):
  features = {**raw_features}
  # Create a new feature.
  features["sin_f1"] = tf.sin(features["f1"])
  # Remove a feature
  del features["f1"]
  return features

# Create Numpy dataset
raw_dataset = {
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
}

# Apply the preprocessing on the training dataset.
processed_dataset = (
    tf.data.Dataset.from_tensor_slices(raw_dataset)
    .batch(128)  # The batch size has no impact on the model.
    .map(preprocessing)
    .prefetch(tf.data.AUTOTUNE)
)

# Train a model on the pre-processed dataset.
ydf_model = ydf.RandomForestLearner(
    label="l",
    task=ydf.Task.CLASSIFICATION,
).train(processed_dataset)

# Export the model to a raw SavedModel model with the pre-processing
model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=False,
    pre_processing=pre_processing,
    tensor_specs={
        "f1": tf.TensorSpec(shape=[None], name="f1", dtype=tf.float64),
        "f2": tf.TensorSpec(shape=[None], name="f2", dtype=tf.float64),
    }
)

# Export the model to a SavedModel consuming serialized tf examples with the
# pre-processing
model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    feed_example_proto=True,
    pre_processing=pre_processing,
    feature_specs={
        "f1": tf.io.FixedLenFeature(
            shape=[], dtype=tf.float32, default_value=math.nan
        ),
        "f2": tf.io.FixedLenFeature(
            shape=[], dtype=tf.float32, default_value=math.nan
        ),
    }
)

For more flexibility, use the method to_tensorflow_function instead of to_tensorflow_saved_model.

Note that export to Tensorflow is not yet available for Isolation Forest models.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to store the TensorFlow Decision Forests model.	required
`input_model_signature_fn`	`Any`	A lambda that returns the (Dense,Sparse,Ragged)TensorSpec (or structure of TensorSpec e.g. dictionary, list) corresponding to input signature of the model. If not specified, the input model signature is created by `tfdf.keras.build_default_input_model_signature`. For example, specify `input_model_signature_fn` if a numerical input feature (which is consumed as DenseTensorSpec(float32) by default) will be fed differently (e.g. RaggedTensor(int64)). Only compatible with mode="keras".	`None`
`mode`	`Literal['keras', 'tf']`	How the YDF model is converted into a TensorFlow SavedModel. 1) mode = "keras" (default): Turn the model into a Keras 2 model using TensorFlow Decision Forests, and then save it with `tf_keras.models.save_model`. 2) mode = "tf" (recommended; will become default): Turn the model into a TensorFlow Module, and save it with `tf.saved_model.save`.	`'tf'`
`feature_dtypes`	`Dict[str, TFDType]`	Mapping from feature name to TensorFlow dtype. Use this mapping to override feature dtypes. For instance, numerical features are encoded with tf.float32 by default. If you plan on feeding tf.float64 or tf.int32, use `feature_dtype` to specify it. `feature_dtypes` is ignored if `tensor_specs` is set. If set, disables the automatic signature extraction on `pre_processing` (if `pre_processing` is also set). Only compatible with mode="tf".	`{}`
`servo_api`	`bool`	If true, adds a SavedModel signature to make the model compatible with the `Classify` or `Regress` servo APIs. Only compatible with mode="tf". If false, outputs the raw model predictions.	`False`
`feed_example_proto`	`bool`	If false, the model expects for the input features to be provided as TensorFlow values. This is the most efficient way to make predictions. If true, the model expects for the input features to be provided as a binary serialized TensorFlow Example proto. This is the format expected by VertexAI and most TensorFlow Serving pipelines.	`False`
`pre_processing`	`Optional[Callable]`	Optional TensorFlow function or module to apply on the input features before applying the model. If the `pre_processing` function has been traced (i.e., the function has been called once with actual data and contains a concrete instance in its cache), this signature is extracted and used as the signature of the SavedModel. Only compatible with mode="tf".	`None`
`post_processing`	`Optional[Callable]`	Optional TensorFlow function or module to apply on the model predictions. Only compatible with mode="tf".	`None`
`temp_dir`	`Optional[str]`	Temporary directory used during the conversion. If None (default), uses `tempfile.mkdtemp` default temporary directory.	`None`
`tensor_specs`	`Optional[Dict[str, Any]]`	Optional dictionary of `tf.TensorSpec` that define the input features of the model to export. If not provided, the `TensorSpec`s are automatically generated based on the model features seen during training. This means that "tensor_specs" is only necessary when using a "pre_processing" argument that expects different features than what the model was trained with. This argument is ignored when exporting model with `feed_example_proto=True` as in this case, the TensorSpecs are defined by the `tf.io.parse_example` parsing feature specs. Only compatible with mode="tf".	`None`
`feature_specs`	`Optional[Dict[str, Any]]`	Optional dictionary of `tf.io.parse_example` parsing feature specs e.g. `tf.io.FixedLenFeature` or `tf.io.RaggedFeature`. If not provided, the parsing feature specs are automatically generated based on the model features seen during training. This means that "feature_specs" is only necessary when using a "pre_processing" argument that expects different features than what the model was trained with. This argument is ignored when exporting model with `feed_example_proto=False`. Only compatible with mode="tf".	`None`
`force`	`bool`	Tries to export even in currently unsupported environments. WARNING: Setting this to true may crash the Python runtime.	`False`

training_logs ¶

training_logs() -> List[TrainingLogEntry]

Returns the model's training logs.

The training logs contain performance metrics calculated periodically during model training. The content and evaluation method depend on the model type (e.g., out-of-bag for Random Forest, validation set for Gradient Boosted Trees).

Usage example:

import pandas as pd
import ydf
import matplotlib.pyplot as plt

# Train a model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

# Get the training logs
logs = model.training_logs()

# Plot the accuracy over training iterations
plt.plot(
    [log.iteration for log in logs],
    [log.evaluation.accuracy for log in logs]
)
plt.xlabel("Iteration (Number of Trees)")
plt.ylabel("Validation Accuracy")
plt.show()

Returns:

Type	Description
`List[TrainingLogEntry]`	A list of `TrainingLogEntry` objects.

update_with_jax_params ¶

update_with_jax_params(params: Dict[str, Any])

Updates the model's parameters with values from a JAX fine-tuning process.

This function allows you to take a model fine-tuned in JAX (after being exported with to_jax_function(leaves_as_params=True)) and update the original YDF model object with the new parameters.

Usage example:

import ydf
import jax

# Train a model with YDF
# dataset = ...
model = ydf.GradientBoostedTreesLearner(label="l").train(dataset)

# Convert to a JAX function with learnable parameters
jax_model = model.to_jax_function(leaves_as_params=True)

# Fine-tune the parameters in JAX
# jax_model.params = my_fine_tuning_logic(jax_model.params, ...)

# Update the YDF model with the new parameters
model.update_with_jax_params(jax_model.params)

# The YDF model now reflects the fine-tuning
# model.save("/path/to/finetuned_model")

Parameters:

Name	Type	Description	Default
`params`	`Dict[str, Any]`	A dictionary of model parameters, as produced by `to_jax_function`.	required

variable_importances ¶

variable_importances() -> (
    Dict[str, List[Tuple[float, str]]]
)

Returns the variable importances (VIs) of the model.

Variable importances indicate how much each feature contributes to the model's predictions. Different VI metrics have different semantics and are generally not comparable.

The available VIs depend on the learning algorithm and its hyperparameters. For example, for Random Forest, setting compute_oob_variable_importances=True enables the computation of permutation out-of-bag VIs.

Usage example:

# Train a Random Forest and enable OOB VI computation.
learner = ydf.RandomForestLearner(
    label="species", compute_oob_variable_importances=True
)
model = learner.train(dataset)

# List available VI metrics.
print(model.variable_importances().keys())
# dict_keys(['NUM_AS_ROOT', 'SUM_SCORE', 'MEAN_DECREASE_IN_ACCURACY'])

# Get a specific VI, sorted by importance.
vi = model.variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
# [('bill_length_mm', 0.0713), ('island', 0.0072), ...]

Returns:

Type	Description
`Dict[str, List[Tuple[float, str]]]`	A dictionary where keys are the names of the VI metrics and values are
`Dict[str, List[Tuple[float, str]]]`	lists of `(importance_value, feature_name)` tuples, sorted in descending
`Dict[str, List[Tuple[float, str]]]`	order of importance.