Skip to content

GenericModel

GenericModel

GenericModel(raw_model: GenericCCModel)

Abstract superclass for all YDF models.

analyze

analyze(data: InputDataset, sampling: float = 1.0, num_bins: int = 50, partial_depepence_plot: bool = True, conditional_expectation_plot: bool = True, permutation_variable_importance_rounds: int = 1, num_threads: int = 6) -> Analysis

Analyzes a model on a test dataset.

An analysis contains structual information about the model (e.g., variable importances), and the information about the application of the model on the given dataset (e.g. partial dependence plots).

For a large dataset (many examples and / or features), computing the analysis can take significant time.

While some information might be valid, it is generatly not recommended to analyze a model on its training dataset.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
analysis = model.analyze(test_ds)

# Display the analysis in a notebook.
analysis

Parameters:

Name Type Description Default
data InputDataset

Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.

required
sampling float

Ratio of examples to use for the analysis. The analysis can be expensive to compute. On large datasets, use a small sampling value e.g. 0.01.

1.0
num_bins int

Number of bins used to accumulate statistics. A large value increase the resolution of the plots but takes more time to compute.

50
partial_depepence_plot bool

Compute partial dependency plots a.k.a PDPs. Expensive to compute.

True
conditional_expectation_plot bool

Compute the conditional expectation plots a.k.a. CEP. Cheap to compute.

True
permutation_variable_importance_rounds int

If >1, computes permutation variable importances using "permutation_variable_importance_rounds" rounds. The most rounds the more accurate the results. Using a single round is often acceptable i.e. permutation_variable_importance_rounds=1. If permutation_variable_importance_rounds=0, disables the computation of permutation variable importances.

1
num_threads int

Number of threads to use to compute the analysis.

6

Returns:

Type Description
Analysis

Model analysis.

analyze_prediction

analyze_prediction(single_example: InputDataset) -> PredictionAnalysis

Understands a single prediction of the model.

Note: To explain the model as a whole, use model.analyze instead.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")

# We want to explain the model prediction on the first test example.
selected_example = test_ds.iloc[:1]

analysis = model.analyze_prediction(selected_example, test_ds)

# Display the analysis in a notebook.
analysis

Parameters:

Name Type Description Default
single_example InputDataset

Example to explain. Can be a dictionary of lists or numpy arrays of values, Pandas DataFrame, or a VerticalDataset.

required

Returns:

Type Description
PredictionAnalysis

Prediction explanation.

benchmark

benchmark(ds: InputDataset, benchmark_duration: float = 3, warmup_duration: float = 1, batch_size: int = 100) -> BenchmarkInferenceCCResult

Benchmark the inference speed of the model on the given dataset.

This benchmark creates batched predictions on the given dataset using the C++ API of Yggdrasil Decision Forests. Note that inference times using other APIs or on different machines will be different. A serving template for the C++ API can be generated with model.to_cpp().

Parameters:

Name Type Description Default
ds InputDataset

Dataset to perform the benchmark on.

required
benchmark_duration float

Total duration of the benchmark in seconds. Note that this number is only indicative and the actual duration of the benchmark may be shorter or longer. This parameter must be > 0.

3
warmup_duration float

Total duration of the warmup runs before the benchmark in seconds. During the warmup phase, the benchmark is run without being timed. This allows warming up caches. The benchmark will always run at least one batch for warmup. This parameter must be > 0. batch_size: Size of batches when feeding examples to the inference engines. The impact of this parameter on the results depends on the architecture running the benchmark (notably, cache sizes).

1

Returns:

Type Description
BenchmarkInferenceCCResult

Benchmark results.

data_spec

data_spec() -> DataSpecification

Returns the data spec used for train the model.

describe

describe(output_format: Literal['auto', 'text', 'notebook', 'html'] = 'auto', full_details: bool = False) -> Union[str, HtmlNotebookDisplay]

Description of the model.

Parameters:

Name Type Description Default
output_format Literal['auto', 'text', 'notebook', 'html']

Format of the display: - auto: Use the "notebook" format if executed in an IPython notebook / Colab. Otherwise, use the "text" format. - text: Text description of the model. - html: Html description of the model. - notebook: Html description of the model displayed in a notebook cell.

'auto'
full_details bool

Should the full model be printed. This can be large.

False

Returns:

Type Description
Union[str, HtmlNotebookDisplay]

The model description.

evaluate

evaluate(data: InputDataset, bootstrapping: Union[bool, int] = False) -> Evaluation

Evaluates the quality of a model on a dataset.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
evaluation = model.evaluates(test_ds)

In a notebook, if a cell returns an evaluation object, this evaluation will be as a rich html with plots:

evaluation = model.evaluate(test_ds)
evaluation

Parameters:

Name Type Description Default
data InputDataset

Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.

required
bootstrapping Union[bool, int]

Controls whether bootstrapping is used to evaluate the confidence intervals and statistical tests (i.e., all the metrics ending with "[B]"). If set to false, bootstrapping is disabled. If set to true, bootstrapping is enabled and 2000 bootstrapping samples are used. If set to an integer, it specifies the number of bootstrapping samples to use. In this case, if the number is less than 100, an error is raised as bootstrapping will not yield useful results.

False

Returns:

Type Description
Evaluation

Model evaluation.

force_engine

force_engine(engine_name: Optional[str]) -> None

Forces the engines used by the model.

If not specified (i.e., None; default value), the fastest compatible engine (i.e., the first value returned from "list_compatible_engines") is used for all model inferences (e.g., model.predict, model.evaluate).

If passing a non-existing or non-compatible engine, the next model inference (e.g., model.predict, model.evaluate) will fail.

Parameters:

Name Type Description Default
engine_name Optional[str]

Name of a compatible engine or None to automatically select the fastest engine.

required

hyperparameter_optimizer_logs

hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]

Returns the logs of the hyper-parameter tuning.

If the model is not trained with hyper-parameter tuning, returns None.

input_feature_names

input_feature_names() -> List[str]

Returns the names of the input features.

The features are sorted in increasing order of column_idx.

input_features

input_features() -> Sequence[InputFeature]

Returns the input features of the model.

The features are sorted in increasing order of column_idx.

label

label() -> str

Name of the label column.

label_classes

label_classes() -> List[str]

Returns the label classes for classification tasks, None otherwise.

list_compatible_engines

list_compatible_engines() -> Sequence[str]

Lists the inference engines compatible with the model.

The engines are sorted to likely-fastest to likely-slowest.

Returns:

Type Description
Sequence[str]

List of compatible engines.

metadata

metadata() -> ModelMetadata

Metadata associated with the model.

A model's metadata contains information stored with the model that does not influence the model's predictions (e.g. data created). When distributing a model for wide release, it may be useful to clear / modify the model metadata with model.set_metadata(ydf.ModelMetadata()).

Returns:

Type Description
ModelMetadata

The model's metadata.

name

name() -> str

Returns the name of the model type.

predict

predict(data: InputDataset) -> ndarray

Returns the predictions of the model on the given dataset.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
predictions = model.predict(test_ds)

Parameters:

Name Type Description Default
data InputDataset

Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. If the dataset contains the label column, that column is ignored.

required

save

save(path, advanced_options=ModelIOOptions()) -> None

Save the model to disk.

YDF uses a proprietary model format for saving models. A model consists of multiple files located in the same directory. A directory should only contain a single YDF model. See advanced_options for more information.

YDF models can also be exported to other formats, see to_tensorflow_saved_model() and to_cpp() for details.

YDF saves some metadata inside the model, see model.metadata() for details. Before distributing a model to the world, consider removing metadata with model.set_metadata(ydf.ModelMetadata()).

Usage example:

import pandas as pd
import ydf

# Train a Random Forest model
df = pd.read_csv("my_dataset.csv")
model = ydf.RandomForestLearner().train(df)

# Save the model to disk
model.save("/models/my_model")

Parameters:

Name Type Description Default
path

Path to directory to store the model in.

required
advanced_options

Advanced options for saving models.

ModelIOOptions()

self_evaluation

self_evaluation() -> Evaluation

Returns the model's self-evaluation.

Different models use different methods for self-evaluation. Notably, Random Forests use OOB evaluation and Gradient Boosted Trees use evaluation on the validation dataset. Therefore, self-evaluations are not comparable between different model types.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

self_evaluation = model.self_evaluation()
# In an interactive Python environment, print a rich evaluation report.
self_evaluation

set_metadata

set_metadata(metadata: ModelMetadata)

Sets the model metadata.

task

task() -> Task

Task solved by the model.

to_cpp

to_cpp(key: str = 'my_model') -> str

Generates the code of a .h file to run the model in C++.

How to use this function:

  1. Copy the output of this function in a new .h file. open("model.h", "w").write(model.to_cpp())
  2. If you use Bazel/Blaze, create a rule with the dependencies: //third_party/absl/status:statusor //third_party/absl/strings //external/ydf_cc/yggdrasil_decision_forests/api:serving
  3. In your C++ code, include the .h file and call the model with: // Load the model (to do only once). namespace ydf = yggdrasil_decision_forests; const auto model = ydf::exported_model_123::Load(); // Run the model predictions = model.Predict();
  4. The generated "Predict" function takes no inputs. Instead, it fills the input features with placeholder values. Therefore, you will want to add your input as arguments to the "Predict" function, and use it to populate the "examples->Set..." section accordingly.
  5. (Bonus) You can further optimize the inference speed by pre-allocating and re-using the examples and predictions for each thread running the model.

This documentation is also available in the header of the generated content for more details.

Parameters:

Name Type Description Default
key str

Name of the model. Used to define the c++ namespace of the model.

'my_model'

Returns:

Type Description
str

String containing an example header for running the model in C++.

to_tensorflow_function

to_tensorflow_function(temp_dir: Optional[str] = None, can_be_saved: bool = True, squeeze_binary_classification: bool = True) -> Module

Converts the YDF model into a @tf.function callable TensorFlow Module.

The output module can be composed with other TensorFlow operations, including other models serialized with to_tensorflow_function.

This function requires TensorFlow and TensorFlow Decision Forests to be installed. You can install them using the command pip install tensorflow_decision_forests. The generated SavedModel model relies on the TensorFlow Decision Forests Custom Inference Op. This Op is available by default in various platforms such as Servomatic, TensorFlow Serving, Vertex AI, and TensorFlow.js.

Usage example:

!pip install tensorflow_decision_forests

import ydf
import numpy as np
import tensorflow as tf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
})

# Convert model to a TF module.
tf_model = model.to_tensorflow_function()

# Make predictions with the TF module.
tf_predictions = tf_model({
    "f1": tf.constant([0, 0.5, 1]),
    "f2": tf.constant([1, 0, 0.5]),
})

Parameters:

Name Type Description Default
temp_dir Optional[str]

Temporary directory used during the conversion. If None (default), uses tempfile.mkdtemp default temporary directory.

None
can_be_saved bool

If can_be_saved = True (default), the returned module can be saved using tf.saved_model.save. In this case, files created in temporary directory during the conversion are not removed when to_tensorflow_function exit, and those files should still be present when calling tf.saved_model.save. If can_be_saved = False, the files created in the temporary directory during conversion are immediately removed, and the returned object cannot be serialized with tf.saved_model.save.

True
squeeze_binary_classification bool

If true (default), in case of binary classification, outputs a tensor of shape [num examples] containing the probability of the positive class. If false, in case of binary classification, outputs a tensorflow of shape [num examples, 2] containing the probability of both the negative and positive classes. Has no effect on non-binary classification models.

True

Returns:

Type Description
Module

A TensorFlow @tf.function.

to_tensorflow_saved_model

to_tensorflow_saved_model(path: str, input_model_signature_fn: Any = None, *, mode: Literal['keras', 'tf'] = 'keras', feature_dtypes: Dict[str, TFDType] = {}, servo_api: bool = False, feed_example_proto: bool = False, pre_processing: Optional[Callable] = None, post_processing: Optional[Callable] = None, temp_dir: Optional[str] = None) -> None

Exports the model as a TensorFlow Saved model.

This function requires TensorFlow and TensorFlow Decision Forests to be installed. Install them by running the command pip install tensorflow_decision_forests. The generated SavedModel model relies on the TensorFlow Decision Forests Custom Inference Op. This Op is available by default in various platforms such as Servomatic, TensorFlow Serving, Vertex AI, and TensorFlow.js.

Usage example:

!pip install tensorflow_decision_forests

import ydf
import numpy as np
import tensorflow as tf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100).astype(dtype=np.float32),
    "l": np.random.randint(2, size=100),
})

# Export the model to the TensorFlow SavedModel format.
# The model can be executed with Servomatic, TensorFlow Serving and
# Vertex AI.
model.to_tensorflow_saved_model(path="/tmp/my_model", mode="tf")

# The model can also be loaded in TensorFlow and executed locally.

# Load the TensorFlow Saved model.
tf_model = tf.saved_model.load("/tmp/my_model")

# Make predictions
tf_predictions = tf_model({
    "f1": tf.constant(np.random.random(size=10)),
    "f2": tf.constant(np.random.random(size=10), dtype=tf.float32),
})

TensorFlow SavedModel do not cast automatically feature values. For instance, a model trained with a dtype=float32 semantic=numerical feature, will require for this feature to be fed as float32 numbers during inference. You can override the dtype of a feature with the feature_dtypes argument:

model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    # "f1" is fed as an tf.int64 instead of tf.float64
    feature_dtypes={"f1": tf.int64},
)

The SavedModel format allows for custom preprocessing and postprocessing computation in addition to the model inference. Such computation can be specified with the pre_processing and post_processing arguments:

def pre_processing(features):
  features = features.copy()
  features["f1"] = features["f1"] * 2
  return features

model.to_tensorflow_saved_model(
    path="/tmp/my_model",
    mode="tf",
    pre_processing=pre_processing,
)

For more complex combinations, such as composing multiple models, use the method to_tensorflow_function instead of to_tensorflow_saved_model.

Parameters:

Name Type Description Default
path str

Path to store the Tensorflow Decision Forests model.

required
input_model_signature_fn Any

A lambda that returns the (Dense,Sparse,Ragged)TensorSpec (or structure of TensorSpec e.g. dictionary, list) corresponding to input signature of the model. If not specified, the input model signature is created by tfdf.keras.build_default_input_model_signature. For example, specify input_model_signature_fn if an numerical input feature (which is consumed as DenseTensorSpec(float32) by default) will be feed differently (e.g. RaggedTensor(int64)). Only compatible with mode="keras".

None
mode Literal['keras', 'tf']

How is the YDF converted into a TensorFlow SavedModel. 1) mode = "keras" (default): Turn the model into a Keras 2 model using TensorFlow Decision Forests, and then save it with tf_keras.models.save_model. 2) mode = "tf" (recommended; will become default): Turn the model into a TensorFlow Module, and save it with tf.saved_model.save.

'keras'
feature_dtypes Dict[str, TFDType]

Mapping from feature name to TensorFlow dtype. Use this mapping to feature dtype. For instance, numerical features are encoded with tf.float32 by default. If you plan on feeding tf.float64 or tf.int32, use feature_dtype to specify it. Only compatible with mode="tf".

{}
servo_api bool

If true, adds a SavedModel signature to make the model compatible with the Classify or Regress servo APIs. Only compatible with mode="tf". If false, outputs the raw model predictions.

False
feed_example_proto bool

If false, the model expects for the input features to be provided as TensorFlow values. This is most efficient way to make predictions. If true, the model expects for the input featurs to be provided as a binary serialized TensorFlow Example proto. This is the format expected by VertexAI and most TensorFlow Serving pipelines.

False
pre_processing Optional[Callable]

Optional TensorFlow function or module to apply on the input features before applying the model. Only compatible with mode="tf".

None
post_processing Optional[Callable]

Optional TensorFlow function or module to apply on the model predictions. Only compatible with mode="tf".

None
temp_dir Optional[str]

Temporary directory used during the conversion. If None (default), uses tempfile.mkdtemp default temporary directory.

None

variable_importances

variable_importances() -> Dict[str, List[Tuple[float, str]]]

Variable importances to measure the impact of features on the model.

Variable importances generally indicates how much a variable (feature) contributes to the model predictions or quality. Different Variable importances have different semantics and are generally not comparable.

The variable importances returned by variable_importances() depends on the learning algorithm and its hyper-parameters. For example, the hyperparameter compute_oob_variable_importances=True of the Random Forest learner enables the computation of permutation out-of-bag variable importances.

TODO: Add variable importances to documentation.

Features are sorted by decreasing importance.

Usage example:

# Train a Random Forest. Enable the computation of OOB (out-of-bag) variable
# importances.
model = ydf.RandomForestModel(compute_oob_variable_importances=True,
                              label=...).train(ds)
# List the available variable importances.
print(model.variable_importances().keys())

# Show a specific variable importance.
model.variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
>> [("bill_length_mm", 0.0713061951754389),
    ("island", 0.007298519736842035),
    ("flipper_length_mm", 0.004505893640351366),
...

Returns:

Type Description
Dict[str, List[Tuple[float, str]]]

Variable importances.