GradientBoostedTreesModel
- GradientBoostedTreesModel
- add_tree
- analyze
- analyze_prediction
- benchmark
- data_spec
- describe
- distance
- evaluate
- force_engine
- get_all_trees
- get_tree
- hyperparameter_optimizer_logs
- initial_predictions
- input_feature_names
- input_features
- iter_trees
- label
- label_classes
- list_compatible_engines
- metadata
- name
- num_trees
- plot_tree
- predict
- predict_leaves
- print_tree
- remove_tree
- save
- self_evaluation
- set_metadata
- set_node_format
- set_tree
- task
- to_cpp
- to_tensorflow_function
- to_tensorflow_saved_model
- validation_evaluation
- validation_loss
- variable_importances
GradientBoostedTreesModel
Bases: DecisionForestModel
A Gradient Boosted Trees model for prediction and inspection.
add_tree
add_tree(tree: Tree) -> None
Adds a single tree of the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree |
Tree
|
New tree. |
required |
analyze
analyze(data: InputDataset, sampling: float = 1.0, num_bins: int = 50, partial_depepence_plot: bool = True, conditional_expectation_plot: bool = True, permutation_variable_importance_rounds: int = 1, num_threads: int = 6) -> Analysis
Analyzes a model on a test dataset.
An analysis contains structual information about the model (e.g., variable importances), and the information about the application of the model on the given dataset (e.g. partial dependence plots).
For a large dataset (many examples and / or features), computing the analysis can take significant time.
While some information might be valid, it is generatly not recommended to analyze a model on its training dataset.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
test_ds = pd.read_csv("test.csv")
analysis = model.analyze(test_ds)
# Display the analysis in a notebook.
analysis
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
InputDataset
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
required |
sampling |
float
|
Ratio of examples to use for the analysis. The analysis can be expensive to compute. On large datasets, use a small sampling value e.g. 0.01. |
1.0
|
num_bins |
int
|
Number of bins used to accumulate statistics. A large value increase the resolution of the plots but takes more time to compute. |
50
|
partial_depepence_plot |
bool
|
Compute partial dependency plots a.k.a PDPs. Expensive to compute. |
True
|
conditional_expectation_plot |
bool
|
Compute the conditional expectation plots a.k.a. CEP. Cheap to compute. |
True
|
permutation_variable_importance_rounds |
int
|
If >1, computes permutation variable importances using "permutation_variable_importance_rounds" rounds. The most rounds the more accurate the results. Using a single round is often acceptable i.e. permutation_variable_importance_rounds=1. If permutation_variable_importance_rounds=0, disables the computation of permutation variable importances. |
1
|
num_threads |
int
|
Number of threads to use to compute the analysis. |
6
|
Returns:
Type | Description |
---|---|
Analysis
|
Model analysis. |
analyze_prediction
Understands a single prediction of the model.
Note: To explain the model as a whole, use model.analyze
instead.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
test_ds = pd.read_csv("test.csv")
# We want to explain the model prediction on the first test example.
selected_example = test_ds.iloc[:1]
analysis = model.analyze_prediction(selected_example, test_ds)
# Display the analysis in a notebook.
analysis
Parameters:
Name | Type | Description | Default |
---|---|---|---|
single_example |
InputDataset
|
Example to explain. Can be a dictionary of lists or numpy arrays of values, Pandas DataFrame, or a VerticalDataset. |
required |
Returns:
Type | Description |
---|---|
PredictionAnalysis
|
Prediction explanation. |
benchmark
benchmark(ds: InputDataset, benchmark_duration: float = 3, warmup_duration: float = 1, batch_size: int = 100) -> BenchmarkInferenceCCResult
Benchmark the inference speed of the model on the given dataset.
This benchmark creates batched predictions on the given dataset using the
C++ API of Yggdrasil Decision Forests. Note that inference times using other
APIs or on different machines will be different. A serving template for the
C++ API can be generated with model.to_cpp()
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ds |
InputDataset
|
Dataset to perform the benchmark on. |
required |
benchmark_duration |
float
|
Total duration of the benchmark in seconds. Note that this number is only indicative and the actual duration of the benchmark may be shorter or longer. This parameter must be > 0. |
3
|
warmup_duration |
float
|
Total duration of the warmup runs before the benchmark in seconds. During the warmup phase, the benchmark is run without being timed. This allows warming up caches. The benchmark will always run at least one batch for warmup. This parameter must be > 0. batch_size: Size of batches when feeding examples to the inference engines. The impact of this parameter on the results depends on the architecture running the benchmark (notably, cache sizes). |
1
|
Returns:
Type | Description |
---|---|
BenchmarkInferenceCCResult
|
Benchmark results. |
describe
describe(output_format: Literal['auto', 'text', 'notebook', 'html'] = 'auto', full_details: bool = False) -> Union[str, HtmlNotebookDisplay]
Description of the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_format |
Literal['auto', 'text', 'notebook', 'html']
|
Format of the display: - auto: Use the "notebook" format if executed in an IPython notebook / Colab. Otherwise, use the "text" format. - text: Text description of the model. - html: Html description of the model. - notebook: Html description of the model displayed in a notebook cell. |
'auto'
|
full_details |
bool
|
Should the full model be printed. This can be large. |
False
|
Returns:
Type | Description |
---|---|
Union[str, HtmlNotebookDisplay]
|
The model description. |
distance
distance(data1: InputDataset, data2: Optional[InputDataset] = None) -> ndarray
Computes the pairwise distance between examples in "data1" and "data2".
If "data2" is not provided, computes the pairwise distance between examples in "data1".
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").Train(train_ds)
test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.
Different models are free to implement different distances with different definitions. For this reasons, unless indicated by the model, distances from different models cannot be compared.
The distance is not guaranteed to satisfy the triangular inequality property of metric distances.
Not all models can compute distances. In this case, this function will raise an Exception.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data1 |
InputDataset
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
required |
data2 |
Optional[InputDataset]
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
None
|
Returns:
Type | Description |
---|---|
ndarray
|
Pairwise distance |
evaluate
Evaluates the quality of a model on a dataset.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
test_ds = pd.read_csv("test.csv")
evaluation = model.evaluates(test_ds)
In a notebook, if a cell returns an evaluation object, this evaluation will be as a rich html with plots:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
InputDataset
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
required |
bootstrapping |
Union[bool, int]
|
Controls whether bootstrapping is used to evaluate the confidence intervals and statistical tests (i.e., all the metrics ending with "[B]"). If set to false, bootstrapping is disabled. If set to true, bootstrapping is enabled and 2000 bootstrapping samples are used. If set to an integer, it specifies the number of bootstrapping samples to use. In this case, if the number is less than 100, an error is raised as bootstrapping will not yield useful results. |
False
|
Returns:
Type | Description |
---|---|
Evaluation
|
Model evaluation. |
force_engine
Forces the engines used by the model.
If not specified (i.e., None; default value), the fastest compatible engine (i.e., the first value returned from "list_compatible_engines") is used for all model inferences (e.g., model.predict, model.evaluate).
If passing a non-existing or non-compatible engine, the next model inference (e.g., model.predict, model.evaluate) will fail.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine_name |
Optional[str]
|
Name of a compatible engine or None to automatically select the fastest engine. |
required |
get_tree
hyperparameter_optimizer_logs
hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]
Returns the logs of the hyper-parameter tuning.
If the model is not trained with hyper-parameter tuning, returns None.
initial_predictions
initial_predictions() -> NDArray[float]
Returns the model's initial predictions (i.e. the model bias).
input_feature_names
Returns the names of the input features.
The features are sorted in increasing order of column_idx.
input_features
input_features() -> Sequence[InputFeature]
Returns the input features of the model.
The features are sorted in increasing order of column_idx.
label_classes
Returns the label classes for classification tasks, None otherwise.
list_compatible_engines
metadata
metadata() -> ModelMetadata
Metadata associated with the model.
A model's metadata contains information stored with the model that does not
influence the model's predictions (e.g. data created). When distributing a
model for wide release, it may be useful to clear / modify the model
metadata with model.set_metadata(ydf.ModelMetadata())
.
Returns:
Type | Description |
---|---|
ModelMetadata
|
The model's metadata. |
plot_tree
plot_tree(tree_idx: int = 0, max_depth: Optional[int] = None, options: Optional[PlotOptions] = None, d3js_url: str = 'https://d3js.org/d3.v6.min.js') -> TreePlot
Plots an interactive HTML rendering of the tree.
Usage example:
# Create a dataset
train_ds = pd.DataFrame({
"c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
"label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx |
int
|
Index of the tree. Should be in [0, self.num_trees()). |
0
|
max_depth |
Optional[int]
|
Maximum tree depth of the plot. Set to None for full depth. |
None
|
options |
Optional[PlotOptions]
|
Advanced options for plotting. Set to None for default style. |
None
|
d3js_url |
str
|
URL to load the d3.js library from. |
'https://d3js.org/d3.v6.min.js'
|
Returns:
Type | Description |
---|---|
TreePlot
|
In interactive environments, an interactive plot. The HTML source can also |
TreePlot
|
be exported to file. |
predict
Returns the predictions of the model on the given dataset.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
test_ds = pd.read_csv("test.csv")
predictions = model.predict(test_ds)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
InputDataset
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. If the dataset contains the label column, that column is ignored. |
required |
predict_leaves
Gets the index of the active leaf in each tree.
The active leaf is the leave that that receive the example during inference.
The returned value "leaves[i,j]" is the index of the active leaf for the i-th example and the j-th tree. Leaves are indexed by depth first exploration with the negative child visited before the positive one.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
InputDataset
|
Dataset. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Index of the active leaf for each tree in the model. |
print_tree
print_tree(tree_idx: int = 0, file=sys.stdout) -> None
Prints a tree in the terminal.
Usage example:
# Create a dataset
train_ds = pd.DataFrame({
"c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
"label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx |
int
|
Index of the tree. Should be in [0, self.num_trees()). |
0
|
file |
Where to print the tree. By default, prints on the terminal standard output. |
stdout
|
remove_tree
remove_tree(tree_idx: int) -> None
Removes a single tree of the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx |
int
|
Index of the tree. Should be in [0, num_trees()). |
required |
save
Save the model to disk.
YDF uses a proprietary model format for saving models. A model consists of
multiple files located in the same directory.
A directory should only contain a single YDF model. See advanced_options
for more information.
YDF models can also be exported to other formats, see
to_tensorflow_saved_model()
and to_cpp()
for details.
YDF saves some metadata inside the model, see model.metadata()
for
details. Before distributing a model to the world, consider removing
metadata with model.set_metadata(ydf.ModelMetadata())
.
Usage example:
import pandas as pd
import ydf
# Train a Random Forest model
df = pd.read_csv("my_dataset.csv")
model = ydf.RandomForestLearner().train(df)
# Save the model to disk
model.save("/models/my_model")
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Path to directory to store the model in. |
required | |
advanced_options |
Advanced options for saving models. |
ModelIOOptions()
|
self_evaluation
self_evaluation() -> Optional[Evaluation]
Returns the model's self-evaluation.
For Gradient Boosted Trees models, the self-evaluation is the evaluation on the validation dataset. Note that the validation dataset is extracted automatically if not explicitly given. If the validation dataset is deactivated, no self-evaluation is computed.
Different models use different methods for self-evaluation. Notably, Random Forests use the last Out-Of-Bag evaluation. Therefore, self-evaluations are not comparable between different model types.
Returns None if no self-evaluation has been computed.
Usage example:
set_node_format
Set the serialization format for the nodes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node_format |
NodeFormat
|
Node format to use when saving the model. |
required |
set_tree
to_cpp
Generates the code of a .h file to run the model in C++.
How to use this function:
- Copy the output of this function in a new .h file. open("model.h", "w").write(model.to_cpp())
- If you use Bazel/Blaze, create a rule with the dependencies: //third_party/absl/status:statusor //third_party/absl/strings //external/ydf_cc/yggdrasil_decision_forests/api:serving
- In your C++ code, include the .h file and call the model with:
// Load the model (to do only once).
namespace ydf = yggdrasil_decision_forests;
const auto model = ydf::exported_model_123::Load(
); // Run the model predictions = model.Predict(); - The generated "Predict" function takes no inputs. Instead, it fills the input features with placeholder values. Therefore, you will want to add your input as arguments to the "Predict" function, and use it to populate the "examples->Set..." section accordingly.
- (Bonus) You can further optimize the inference speed by pre-allocating and re-using the examples and predictions for each thread running the model.
This documentation is also available in the header of the generated content for more details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
Name of the model. Used to define the c++ namespace of the model. |
'my_model'
|
Returns:
Type | Description |
---|---|
str
|
String containing an example header for running the model in C++. |
to_tensorflow_function
to_tensorflow_function(temp_dir: Optional[str] = None, can_be_saved: bool = True, squeeze_binary_classification: bool = True) -> Module
Converts the YDF model into a @tf.function callable TensorFlow Module.
The output module can be composed with other TensorFlow operations,
including other models serialized with to_tensorflow_function
.
This function requires TensorFlow and TensorFlow Decision Forests to be
installed. You can install them using the command pip install
tensorflow_decision_forests
. The generated SavedModel model relies on the
TensorFlow Decision Forests Custom Inference Op. This Op is available by
default in various platforms such as Servomatic, TensorFlow Serving, Vertex
AI, and TensorFlow.js.
Usage example:
!pip install tensorflow_decision_forests
import ydf
import numpy as np
import tensorflow as tf
# Train a model.
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(size=100),
"f2": np.random.random(size=100),
"l": np.random.randint(2, size=100),
})
# Convert model to a TF module.
tf_model = model.to_tensorflow_function()
# Make predictions with the TF module.
tf_predictions = tf_model({
"f1": tf.constant([0, 0.5, 1]),
"f2": tf.constant([1, 0, 0.5]),
})
Parameters:
Name | Type | Description | Default |
---|---|---|---|
temp_dir |
Optional[str]
|
Temporary directory used during the conversion. If None
(default), uses |
None
|
can_be_saved |
bool
|
If can_be_saved = True (default), the returned module can be
saved using |
True
|
squeeze_binary_classification |
bool
|
If true (default), in case of binary classification, outputs a tensor of shape [num examples] containing the probability of the positive class. If false, in case of binary classification, outputs a tensorflow of shape [num examples, 2] containing the probability of both the negative and positive classes. Has no effect on non-binary classification models. |
True
|
Returns:
Type | Description |
---|---|
Module
|
A TensorFlow @tf.function. |
to_tensorflow_saved_model
to_tensorflow_saved_model(path: str, input_model_signature_fn: Any = None, *, mode: Literal['keras', 'tf'] = 'keras', feature_dtypes: Dict[str, TFDType] = {}, servo_api: bool = False, feed_example_proto: bool = False, pre_processing: Optional[Callable] = None, post_processing: Optional[Callable] = None, temp_dir: Optional[str] = None) -> None
Exports the model as a TensorFlow Saved model.
This function requires TensorFlow and TensorFlow Decision Forests to be
installed. Install them by running the command pip install
tensorflow_decision_forests
. The generated SavedModel model relies on the
TensorFlow Decision Forests Custom Inference Op. This Op is available by
default in various platforms such as Servomatic, TensorFlow Serving, Vertex
AI, and TensorFlow.js.
Usage example:
!pip install tensorflow_decision_forests
import ydf
import numpy as np
import tensorflow as tf
# Train a model.
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(size=100),
"f2": np.random.random(size=100).astype(dtype=np.float32),
"l": np.random.randint(2, size=100),
})
# Export the model to the TensorFlow SavedModel format.
# The model can be executed with Servomatic, TensorFlow Serving and
# Vertex AI.
model.to_tensorflow_saved_model(path="/tmp/my_model", mode="tf")
# The model can also be loaded in TensorFlow and executed locally.
# Load the TensorFlow Saved model.
tf_model = tf.saved_model.load("/tmp/my_model")
# Make predictions
tf_predictions = tf_model({
"f1": tf.constant(np.random.random(size=10)),
"f2": tf.constant(np.random.random(size=10), dtype=tf.float32),
})
TensorFlow SavedModel do not cast automatically feature values. For
instance, a model trained with a dtype=float32 semantic=numerical feature,
will require for this feature to be fed as float32 numbers during inference.
You can override the dtype of a feature with the feature_dtypes
argument:
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
# "f1" is fed as an tf.int64 instead of tf.float64
feature_dtypes={"f1": tf.int64},
)
The SavedModel format allows for custom preprocessing and postprocessing
computation in addition to the model inference. Such computation can be
specified with the pre_processing
and post_processing
arguments:
def pre_processing(features):
features = features.copy()
features["f1"] = features["f1"] * 2
return features
model.to_tensorflow_saved_model(
path="/tmp/my_model",
mode="tf",
pre_processing=pre_processing,
)
For more complex combinations, such as composing multiple models, use the
method to_tensorflow_function
instead of to_tensorflow_saved_model
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
Path to store the Tensorflow Decision Forests model. |
required |
input_model_signature_fn |
Any
|
A lambda that returns the
(Dense,Sparse,Ragged)TensorSpec (or structure of TensorSpec e.g.
dictionary, list) corresponding to input signature of the model. If not
specified, the input model signature is created by
|
None
|
mode |
Literal['keras', 'tf']
|
How is the YDF converted into a TensorFlow SavedModel. 1) mode =
"keras" (default): Turn the model into a Keras 2 model using TensorFlow
Decision Forests, and then save it with |
'keras'
|
feature_dtypes |
Dict[str, TFDType]
|
Mapping from feature name to TensorFlow dtype. Use this
mapping to feature dtype. For instance, numerical features are encoded
with tf.float32 by default. If you plan on feeding tf.float64 or
tf.int32, use |
{}
|
servo_api |
bool
|
If true, adds a SavedModel signature to make the model
compatible with the |
False
|
feed_example_proto |
bool
|
If false, the model expects for the input features to be provided as TensorFlow values. This is most efficient way to make predictions. If true, the model expects for the input featurs to be provided as a binary serialized TensorFlow Example proto. This is the format expected by VertexAI and most TensorFlow Serving pipelines. |
False
|
pre_processing |
Optional[Callable]
|
Optional TensorFlow function or module to apply on the input features before applying the model. Only compatible with mode="tf". |
None
|
post_processing |
Optional[Callable]
|
Optional TensorFlow function or module to apply on the model predictions. Only compatible with mode="tf". |
None
|
temp_dir |
Optional[str]
|
Temporary directory used during the conversion. If None
(default), uses |
None
|
validation_evaluation
validation_evaluation() -> Optional[Evaluation]
Returns the validation evaluation of the model, if available.
Gradient Boosted Trees use a validation dataset for early stopping.
Returns None if no validation evaluation been computed or it has been removed from the model.
Usage example:
validation_loss
Returns loss on the validation dataset if available.
variable_importances
Variable importances to measure the impact of features on the model.
Variable importances generally indicates how much a variable (feature) contributes to the model predictions or quality. Different Variable importances have different semantics and are generally not comparable.
The variable importances returned by variable_importances()
depends on the
learning algorithm and its hyper-parameters. For example, the hyperparameter
compute_oob_variable_importances=True
of the Random Forest learner enables
the computation of permutation out-of-bag variable importances.
TODO: Add variable importances to documentation.
Features are sorted by decreasing importance.
Usage example:
# Train a Random Forest. Enable the computation of OOB (out-of-bag) variable
# importances.
model = ydf.RandomForestModel(compute_oob_variable_importances=True,
label=...).train(ds)
# List the available variable importances.
print(model.variable_importances().keys())
# Show a specific variable importance.
model.variable_importances()["MEAN_DECREASE_IN_ACCURACY"]
>> [("bill_length_mm", 0.0713061951754389),
("island", 0.007298519736842035),
("flipper_length_mm", 0.004505893640351366),
...
Returns:
Type | Description |
---|---|
Dict[str, List[Tuple[float, str]]]
|
Variable importances. |