Migrating to YDFĀ¶
YDF and TensorFlow Decision Forests (TF-DF) are both front-ends to the same high-performance C++ implementation of Decision Forests algorithms. Both libraries are developed by the same team and use the same training code, which means that models trained by either library will be identical.
YDF is the successor of TF-DF and it is both significantly more feature-rich, efficient, and easier to use than TF-DF.
Benefits at a glanceĀ¶
YDF | TensorFlow Decision Forests | |
---|---|---|
Model description | model.describe() produces rich model description html or text report. |
model.summary() produces a less complete text report,but does not work if applied on a model loaded from disk. |
Model evaluation | model.evaluate(ds) evaluates a model and returs a rich modelevaluation report. Metrics can also be accessed programmatically. |
Each evaluation metric needs to be configured and run manually withmodel.compile() and model.evalute() . No evaluation report.No confidence intervals. No metrics for ranking and uplifting models. |
Model analysis | model.analyze(ds) produces a rich model analysis html report. |
Not available |
Model benchmarking | model.benchmark(ds) measures and reports the model inference speed. |
Not available |
Custom losses | Available for training Gradient Boosted Trees. | Not available |
Cross-validation | learner.cross_validation(ds) performs a cross-validation and returna rich model evaluation report. |
Not available |
Python model serving | model.predict(ds) makes predictions. |
model.predict(ds) works sometimes. However, because of limitationin the TensorFlow SavedModel format, calling model.predict(ds) ona model loaded from disk might require signature engineering. |
Other model serving | Model directly available in C++, Python, CLI, go and Javascript. You can also use utilities to generate serving code: For example, call model.to_cpp() to generate C++ serving code. Models can be exported to a TensorFlow SavedModel with model.to_tensorflow_saved_model(path) . |
Call model.save(path, signature) to generate a TensorFlow SaveModel, and use the TensorFlow C++ API to run the model in C++. Alternatively, export the model to YDF. |
Training speed | On a small dataset, training up to 5x faster than TensorFlow Decision Forests. On all dataset sizes, model inference is up to 1000x faster than TensorFlow Decision Forests. |
On a small dataset, most of the time is spent in TensorFlow dataset reading. |
Library size | The YDF library is smaller than 10MB. | The TF-DF library is small, but it requires TensorFlow which is ~600MB. |
Error messages | Short, high level and actionable error messages. | Long and hard to understand error messages often about Tensor shapes. |
Do I have to migrate?Ā¶
TensorFlow Decision Forests will continue to be supported and users are not required to migrate their pipelines! If TF-DF and the Keras work well for you, feel free to stay with TF-DF. Our team will continue to release new versions and support users through our various support channels.
For more information, check out the FAQ.
OutlineĀ¶
This guide has three parts:
- Migrating your TF-DF training, inference and evaluation pipeline.
- Importing and exporting your existing TF-DF models.
- Advanced topics: Inspection, Building, Tuning and Distributed Training
This guide does not cover every configuration detail of YDF. See https://ydf.readthedocs.org for a full documentation.
!pip install ydf
import ydf
import pandas as pd
import numpy as np
# Check the version of the packages
print("Found YDF v" + ydf.__version__)
Migrating a training pipelineĀ¶
This section goes through a simple training / evaluation pipeline in YDF.
Model trainingĀ¶
YDF and TF-DF have the same hyperparameters and the same default values, so most training pipelines can be migrated easily.
Summary of changesĀ¶
The comparison below shows the differences between the two training pipelines side-by-side.
TF-DF | YDF |
---|---|
# Install TF-DF
!pip install tensorflow tensorflow_decision_forests
import tensorflow_decision_forests as tfdf
import tensorflow as tf
import pandas as pd
# Load a dataset with Pandas
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
# Convert the dataset to a TensorFlow Dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="my_label")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="my_label")
# Train a model
model = tfdf.keras.RandomForestModel(num_trees=500)
model.fit(train_ds)
# Evaluate model.
model.compile([tf.keras.metrics.SparseCategoricalAccuracy(),tf.keras.metrics.AUC()])
model.evaluate(test_ds)
# Saved model
model.save("project/model")
|
# Install YDF
pip install ydf
import ydf
import pandas as pd
# Load a dataset with Pandas
train_ds = pd.read_csv("train.csv")
test_ds = pd.read_csv("test.csv")
# Train a model
model = ydf.RandomForestLearner(label="my_label", num_trees=500).train(train_ds)
# Evaluate a model (e.g. roc, accuracy, confusion matrix, confidence intervals)
model.evaluate(test_ds)
# Save the model
model.save("/tmp/my_model")
|
YDF | TF-DF | |
---|---|---|
Dataset support | Pandas DataFrame, tf.Data.Dataset, Numpy arrays, CSV files | Tensorflow Datasets, DataFrame via tfdf.keras.pd_dataframe_to_tf_dataset() |
Model training | ydf.RandomForestLearner(label=label).train(train_ds_pd) |
model = tfdf.keras.RandomForestModel() model.fit(train_ds) |
Output verbosity | Global setting ydf.verbose(2) |
Per-model setting verbose=2 in the model constructor. |
Model compilation | Not necessary | model.compile() needed for additional metrics. |
Hyperparameters | Set on the learner. Same names and defaults as in TF-DF. | Set on the model. |
Label column | Argument label= on the learner |
Second "channel" of the input datset |
Example weights | Argument weights= on the learner |
Third "channel" of the input datset |
Model task | Argument task=ydf.Task.REGRESSION on the learner |
Argument task=tfdf.keras.Task.REGRESSION on the model |
Next, we run the YDF training code in a real example.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
# Download and load the dataset as Pandas DataFrames
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# Name of the label column.
label = "income"
# Show extended logs.
ydf.verbose(2)
# Train a Random Forest model with a simple hyperparameter
model = ydf.RandomForestLearner(label=label, num_trees=50).train(train_ds)
# Make predictions with the model
predictions = model.predict(test_ds)
# Show a summary of the model
model.describe()
Model training - the sharp bitsĀ¶
- YDF does not automatically tokenize string columns for use with column type CATEGORICAL_SET. Users need to provide their own tokenization if this column type should be used.
- TF-DF often transforms categorical columns to integers, while YDF does not. The models trained by TF-DF and YDF may therefore differ, even if trained with the same hyperparameters on the same datasets.
Model evaluation, analysis and storageĀ¶
YDF offers more advanced model evaluation and analysis functionality.
Summary of changesĀ¶
YDF | TF-DF | |
---|---|---|
Evaluation | model.evaluate() shows rich plots and many metrics |
model.evaluate() shows few metrics, no plots |
Self-Evaluation | model.self_evaluation() |
model.make_inspector().evaluation() |
Model format | YDF format. Export to SavedModel is possible | TensorFlow SavedModel |
Model loading | ydf.load_model() |
tf_keras.models.load_model() |
Loaded models are equivalent | Loaded models are inference-only | |
Variable Importances | model.variable_importances() |
model.make_inspector().variable_importances() |
Model analysis | model.analyze(test_ds) |
Not available |
Serving with TF Serving | Available with model.to_tensorflow_saved_model() |
Available by default |
Model inspector | Not required (functionality is on the model) | Required for many tasks |
Model Evaluation and Self-EvaluationĀ¶
A model can be evaluated on a test dataset.
As a quick, low-quality alternative, YDF models also provide a self-evaluation. The exact logic of the self-evaluation depends on the model. For example, Random Forest will use Out-of-bag evaluation while Gradient Boosted Trees will use internal train-validation.
# In Colab, this prints a rich evaluation object.
model.evaluate(test_ds)
# Self-evaluation is often good, though it tends to be lower quality than evaluation on test data
model.self_evaluation()
Saving and LoadingĀ¶
The model can be saved to the YDF format for later re-use. For compatibility with TF Serving and other parts of the TensorFlow ecosystem, see Section Export to TF Serving below.
model.save("/tmp/my_ydf_model")
If you reload the model, it is functionally equivalent to the original model.
model_reloaded = ydf.load_model("/tmp/my_ydf_model")
model_reloaded.describe()
Import from / Export to TensorFlowĀ¶
YDF models can be exported to TensorFlow, e.g. for Serving with TF-Serving. See the TF Serving tutorial for a more detailed tutorial for exporting to TensorFlow.
# Exporting requires TF-DF installed.
# !pip install tensorflow_decision_forests
model.to_tensorflow_saved_model("/tmp/my_tensorflow_saved_model")
TF-DF models can be imported to YDF. The imported model is generally equivalent to the original model and should return the same predictions. As the main difference, categorical columns in the imported model must be provided as strings instead of integers.
Note that only TF-DF models containing a single Decision Forest (e.g. a Random Forest or a Gradient Boosted Tree) can be exported in YDF. Other parts of the model graph (e.g. neural networks) cannot be imported.
# Import the TF-DF model. Provide its top-level directory containing the saved_model.pb file.
model_from_tfdf = ydf.from_tensorflow_decision_forests("/tmp/my_tensorflow_saved_model")
model_from_tfdf.describe()
Model AnalysisĀ¶
YDF can compute a detailed model analysis report on a test dataset, including more advanced variable importances.
# Create a rich analysis report
model.analyze(test_ds)
Advanced topics: Inspection, Building, Tuning and Distributed TrainingĀ¶
YDF and TF-DF support a number of advanced features. This guide only outlines the most important changes when transitioning from TF-DF to YDF. For more information, please refer to the tutorials on https://ydf.readthedocs.org, in particular
Model inspector and builderĀ¶
YDF gives users more powerful methods to inspect models and modify models than TF-DF. These methods are now located directly on the model and are much faster than the ones exposed in TF-DF. The inspector
and builder
components from TF-DF are no longer necessary.
# Plot a tree
model.print_tree(tree_idx=0)
# Structural variable importances are available programatically.
model.variable_importances()
# Access a tree directly
tree = model.get_tree(tree_idx=0)
tree
# Modify the tree and add it to the model
if isinstance(tree.root.condition, ydf.tree.CategoricalIsInCondition):
tree.root.condition.mask = [1]
if isinstance(tree.root.condition, ydf.tree.NumericalHigherThanCondition):
tree.root.condition.threshold = 18.22
print(tree)
model.add_tree(tree)
model.print_tree(tree_idx=model_1.num_trees()-1)
Hyperparameter tuningĀ¶
Hyperparameter tuning with YDF is very similar to hyperparameter tuning with TF-DF. Simply change tfdf.tuner.RandomSearch()
to ydf.RandomSearchTuner()
and apply its result as an argument of the learner. YDF then runs the same tuning algorithm with the same parameters.
The Keras Tuner is not supported by YDF.
# Decrease verbosity to avoid long logs
ydf.verbose(1)
# Define the tuner with some options.
tuner = ydf.RandomSearchTuner(num_trials=20)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4,5, 6])
# Train a model with the tuner
model_tuned = ydf.GradientBoostedTreesLearner(
label="income",
num_trees=100, # Used for all the trials.
tuner=tuner,
).train(train_ds)
# See the "Tuning" tab in the description for details.
model_tuned.describe()
Distributed Training / TuningĀ¶
Distributed training in YDF requires datasets as a sequence of paths to dataset files for the individual workers to open. See the YDF distributed training tutorial for details. Distributed training from a finite TensorFlow distributed dataset is not supported in YDF.
Closing remarksĀ¶
The Google Decision Forests team wants to make the migration from TF-DF to YDF as easy as possible. If you have any questions, suggestions, issues or success stories, please contact us at decision-forests-contact@google.com.