Migrating to YDF¶

YDF and TensorFlow Decision Forests (TF-DF) are both front-ends to the same high-performance C++ implementation of Decision Forests algorithms. Both libraries are developed by the same team and use the same training code, which means that models trained by either library will be identical.

YDF is the successor of TF-DF and it is both significantly more feature-rich, efficient, and easier to use than TF-DF.

Benefits at a glance¶

	YDF	TensorFlow Decision Forests
Model description	`model.describe()` produces rich model description html or text report.	`model.summary()` produces a less complete text report, but does not work if applied on a model loaded from disk.
Model evaluation	`model.evaluate(ds)` evaluates a model and returs a rich model evaluation report. Metrics can also be accessed programmatically.	Each evaluation metric needs to be configured and run manually with `model.compile()` and `model.evalute()`. No evaluation report. No confidence intervals. No metrics for ranking and uplifting models.
Model analysis	`model.analyze(ds)` produces a rich model analysis html report.	Not available
Model benchmarking	`model.benchmark(ds)` measures and reports the model inference speed.	Not available
Custom losses	Available for training Gradient Boosted Trees.	Not available
Cross-validation	`learner.cross_validation(ds)` performs a cross-validation and return a rich model evaluation report.	Not available
Python model serving	`model.predict(ds)` makes predictions.	`model.predict(ds)` works sometimes. However, because of limitation in the TensorFlow SavedModel format, calling `model.predict(ds)` on a model loaded from disk might require signature engineering.
Other model serving	Model directly available in C++, Python, CLI, go and Javascript. You can also use utilities to generate serving code: For example, call `model.to_cpp()` to generate C++ serving code. Models can be exported to a TensorFlow SavedModel with `model.to_tensorflow_saved_model(path)`.	Call `model.save(path, signature)` to generate a TensorFlow SaveModel, and use the TensorFlow C++ API to run the model in C++. Alternatively, export the model to YDF.
Training speed	On a small dataset, training up to 5x faster than TensorFlow Decision Forests. On all dataset sizes, model inference is up to 1000x faster than TensorFlow Decision Forests.	On a small dataset, most of the time is spent in TensorFlow dataset reading.
Library size	The YDF library is smaller than 10MB.	The TF-DF library is small, but it requires TensorFlow which is ~600MB.
Error messages	Short, high level and actionable error messages.	Long and hard to understand error messages often about Tensor shapes.

Do I have to migrate?¶

TensorFlow Decision Forests will continue to be supported and users are not required to migrate their pipelines! If TF-DF and the Keras work well for you, feel free to stay with TF-DF. Our team will continue to release new versions and support users through our various support channels.

For more information, check out the FAQ.

Outline¶

This guide has three parts:

Migrating your TF-DF training, inference and evaluation pipeline.
Importing and exporting your existing TF-DF models.
Advanced topics: Inspection, Building, Tuning and Distributed Training

This guide does not cover every configuration detail of YDF. See https://ydf.readthedocs.org for a full documentation.

Setup¶

To use ydf, just install the corresponding Python package from Pypi.

In [ ]:

Copied!

!pip install ydf
!pip install ydf

In [ ]:

Copied!





import ydf
import pandas as pd
import numpy as np

# Check the version of the packages
print("Found YDF v" + ydf.__version__)
import ydf
import pandas as pd
import numpy as np

# Check the version of the packages
print("Found YDF v" + ydf.__version__)

Migrating a training pipeline¶

This section goes through a simple training / evaluation pipeline in YDF.

Model training¶

YDF and TF-DF have the same hyperparameters and the same default values, so most training pipelines can be migrated easily.

Summary of changes¶

The comparison below shows the differences between the two training pipelines side-by-side.

TF-DF

YDF

# Install TF-DF
!pip install tensorflow tensorflow_decision_forests

import tensorflow_decision_forests as tfdf
import tensorflow as tf
import pandas as pd

# Load a dataset with Pandas
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Convert the dataset to a TensorFlow Dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="my_label")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="my_label")

# Train a model
model = tfdf.keras.RandomForestModel(num_trees=500)
model.fit(train_ds)

# Evaluate model.
model.compile([tf.keras.metrics.SparseCategoricalAccuracy(),tf.keras.metrics.AUC()])
model.evaluate(test_ds)

# Saved model
model.save("project/model")

# Install YDF
pip install ydf

import ydf
import pandas as pd

# Load a dataset with Pandas
train_ds = pd.read_csv("train.csv")
test_ds = pd.read_csv("test.csv")

# Train a model
model = ydf.RandomForestLearner(label="my_label", num_trees=500).train(train_ds)

# Evaluate a model (e.g. roc, accuracy, confusion matrix, confidence intervals)
model.evaluate(test_ds)

# Save the model
model.save("/tmp/my_model")

	YDF	TF-DF
Dataset support	Pandas DataFrame, tf.Data.Dataset, Numpy arrays, CSV files	Tensorflow Datasets, DataFrame via `tfdf.keras.pd_dataframe_to_tf_dataset()`
Model training	`ydf.RandomForestLearner(label=label).train(train_ds_pd)`	`model = tfdf.keras.RandomForestModel()` `model.fit(train_ds)`
Output verbosity	Global setting `ydf.verbose(2)`	Per-model setting `verbose=2` in the model constructor.
Model compilation	Not necessary	`model.compile()` needed for additional metrics.
Hyperparameters	Set on the learner. Same names and defaults as in TF-DF.	Set on the model.
Label column	Argument `label=` on the learner	Second "channel" of the input datset
Example weights	Argument `weights=` on the learner	Third "channel" of the input datset
Model task	Argument `task=ydf.Task.REGRESSION` on the learner	Argument `task=tfdf.keras.Task.REGRESSION` on the model

Next, we run the YDF training code in a real example.

In [ ]:

Copied!





ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"

# Download and load the dataset as Pandas DataFrames
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Name of the label column.
label = "income"

# Show extended logs.
ydf.verbose(2)

# Train a Random Forest model with a simple hyperparameter
model = ydf.RandomForestLearner(label=label, num_trees=50).train(train_ds)

# Make predictions with the model
predictions = model.predict(test_ds)

# Show a summary of the model
model.describe()
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"

# Download and load the dataset as Pandas DataFrames
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Name of the label column.
label = "income"

# Show extended logs.
ydf.verbose(2)

# Train a Random Forest model with a simple hyperparameter
model = ydf.RandomForestLearner(label=label, num_trees=50).train(train_ds)

# Make predictions with the model
predictions = model.predict(test_ds)

# Show a summary of the model
model.describe()

Model training - the sharp bits¶

YDF does not automatically tokenize string columns for use with column type CATEGORICAL_SET. Users need to provide their own tokenization if this column type should be used.
TF-DF often transforms categorical columns to integers, while YDF does not. The models trained by TF-DF and YDF may therefore differ, even if trained with the same hyperparameters on the same datasets.

Model evaluation, analysis and storage¶

YDF offers more advanced model evaluation and analysis functionality.

Summary of changes¶

	YDF	TF-DF
Evaluation	`model.evaluate()` shows rich plots and many metrics	`model.evaluate()` shows few metrics, no plots
Self-Evaluation	`model.self_evaluation()`	`model.make_inspector().evaluation()`
Model format	YDF format. Export to SavedModel is possible	TensorFlow SavedModel
Model loading	`ydf.load_model()`	`tf_keras.models.load_model()`
	Loaded models are equivalent	Loaded models are inference-only
Variable Importances	`model.variable_importances()`	`model.make_inspector().variable_importances()`
Model analysis	`model.analyze(test_ds)`	Not available
Serving with TF Serving	Available with `model.to_tensorflow_saved_model()`	Available by default
Model inspector	Not required (functionality is on the model)	Required for many tasks

Model Evaluation and Self-Evaluation¶

A model can be evaluated on a test dataset.

As a quick, low-quality alternative, YDF models also provide a self-evaluation. The exact logic of the self-evaluation depends on the model. For example, Random Forest will use Out-of-bag evaluation while Gradient Boosted Trees will use internal train-validation.

In [ ]:

Copied!

# In Colab, this prints a rich evaluation object.
model.evaluate(test_ds)
# In Colab, this prints a rich evaluation object.
model.evaluate(test_ds)

In [ ]:

Copied!

# Self-evaluation is often good, though it tends to be lower quality than evaluation on test data
model.self_evaluation()
# Self-evaluation is often good, though it tends to be lower quality than evaluation on test data
model.self_evaluation()

Saving and Loading¶

The model can be saved to the YDF format for later re-use. For compatibility with TF Serving and other parts of the TensorFlow ecosystem, see Section Export to TF Serving below.

In [ ]:

Copied!

model.save("/tmp/my_ydf_model")
model.save("/tmp/my_ydf_model")

If you reload the model, it is functionally equivalent to the original model.

In [ ]:

Copied!

model_reloaded = ydf.load_model("/tmp/my_ydf_model")
model_reloaded.describe()
model_reloaded = ydf.load_model("/tmp/my_ydf_model")
model_reloaded.describe()

Import from / Export to TensorFlow¶

YDF models can be exported to TensorFlow, e.g. for Serving with TF-Serving. See the TF Serving tutorial for a more detailed tutorial for exporting to TensorFlow.

In [ ]:

Copied!

# Exporting requires TF-DF installed.
# !pip install tensorflow_decision_forests
model.to_tensorflow_saved_model("/tmp/my_tensorflow_saved_model")
# Exporting requires TF-DF installed.
# !pip install tensorflow_decision_forests
model.to_tensorflow_saved_model("/tmp/my_tensorflow_saved_model")

TF-DF models can be imported to YDF. The imported model is generally equivalent to the original model and should return the same predictions. As the main difference, categorical columns in the imported model must be provided as strings instead of integers.

Note that only TF-DF models containing a single Decision Forest (e.g. a Random Forest or a Gradient Boosted Tree) can be exported in YDF. Other parts of the model graph (e.g. neural networks) cannot be imported.

In [ ]:

Copied!

# Import the TF-DF model. Provide its top-level directory containing the saved_model.pb file.
model_from_tfdf = ydf.from_tensorflow_decision_forests("/tmp/my_tensorflow_saved_model")
model_from_tfdf.describe()
# Import the TF-DF model. Provide its top-level directory containing the saved_model.pb file.
model_from_tfdf = ydf.from_tensorflow_decision_forests("/tmp/my_tensorflow_saved_model")
model_from_tfdf.describe()

Model Analysis¶

YDF can compute a detailed model analysis report on a test dataset, including more advanced variable importances.

In [ ]:

Copied!

# Create a rich analysis report
model.analyze(test_ds)
# Create a rich analysis report
model.analyze(test_ds)

Advanced topics: Inspection, Building, Tuning and Distributed Training¶

YDF and TF-DF support a number of advanced features. This guide only outlines the most important changes when transitioning from TF-DF to YDF. For more information, please refer to the tutorials on https://ydf.readthedocs.org, in particular

Model inspector and builder¶

YDF gives users more powerful methods to inspect models and modify models than TF-DF. These methods are now located directly on the model and are much faster than the ones exposed in TF-DF. The inspector and builder components from TF-DF are no longer necessary.

In [ ]:

Copied!

# Plot a tree
model.print_tree(tree_idx=0)
# Plot a tree
model.print_tree(tree_idx=0)

In [ ]:

Copied!

# Structural variable importances are available programatically.
model.variable_importances()
# Structural variable importances are available programatically.
model.variable_importances()

In [ ]:

Copied!

# Access a tree directly
tree = model.get_tree(tree_idx=0)

tree
# Access a tree directly
tree = model.get_tree(tree_idx=0)

tree

In [ ]:

Copied!





# Modify the tree and add it to the model
if isinstance(tree.root.condition, ydf.tree.CategoricalIsInCondition):
  tree.root.condition.mask = [1]
if isinstance(tree.root.condition, ydf.tree.NumericalHigherThanCondition):
  tree.root.condition.threshold = 18.22
print(tree)
model.add_tree(tree)
model.print_tree(tree_idx=model_1.num_trees()-1)
# Modify the tree and add it to the model
if isinstance(tree.root.condition, ydf.tree.CategoricalIsInCondition):
  tree.root.condition.mask = [1]
if isinstance(tree.root.condition, ydf.tree.NumericalHigherThanCondition):
  tree.root.condition.threshold = 18.22
print(tree)
model.add_tree(tree)
model.print_tree(tree_idx=model_1.num_trees()-1)

Hyperparameter tuning¶

Hyperparameter tuning with YDF is very similar to hyperparameter tuning with TF-DF. Simply change tfdf.tuner.RandomSearch() to ydf.RandomSearchTuner() and apply its result as an argument of the learner. YDF then runs the same tuning algorithm with the same parameters.

The Keras Tuner is not supported by YDF.

In [ ]:

Copied!





# Decrease verbosity to avoid long logs
ydf.verbose(1)

# Define the tuner with some options.
tuner = ydf.RandomSearchTuner(num_trials=20)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4,5, 6])

# Train a model with the tuner
model_tuned = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100, # Used for all the trials.
    tuner=tuner,
).train(train_ds)

# See the "Tuning" tab in the description for details.
model_tuned.describe()
# Decrease verbosity to avoid long logs
ydf.verbose(1)

# Define the tuner with some options.
tuner = ydf.RandomSearchTuner(num_trials=20)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4,5, 6])

# Train a model with the tuner
model_tuned = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100, # Used for all the trials.
    tuner=tuner,
).train(train_ds)

# See the "Tuning" tab in the description for details.
model_tuned.describe()

Distributed Training / Tuning¶

Distributed training in YDF requires datasets as a sequence of paths to dataset files for the individual workers to open. See the YDF distributed training tutorial for details. Distributed training from a finite TensorFlow distributed dataset is not supported in YDF.

Closing remarks¶

The Google Decision Forests team wants to make the migration from TF-DF to YDF as easy as possible. If you have any questions, suggestions, issues or success stories, please contact us at decision-forests-contact@google.com.