GradientBoostedTreesModel
- GradientBoostedTreesModel
- activation
- add_tree
- analyze
- analyze_prediction
- benchmark
- data_spec
- describe
- distance
- evaluate
- feature_selection_logs
- force_engine
- get_all_trees
- get_tree
- hyperparameter_optimizer_logs
- initial_predictions
- input_feature_names
- input_features
- input_features_col_idxs
- iter_trees
- label
- label_classes
- label_col_idx
- list_compatible_engines
- metadata
- name
- num_trees
- num_trees_per_iteration
- plot_tree
- predict
- predict_class
- predict_leaves
- print_tree
- remove_tree
- save
- self_evaluation
- serialize
- set_data_spec
- set_feature_selection_logs
- set_initial_predictions
- set_metadata
- set_node_format
- set_tree
- task
- to_cpp
- to_docker
- to_jax_function
- to_tensorflow_function
- to_tensorflow_saved_model
- update_with_jax_params
- validation_evaluation
- validation_loss
- variable_importances
GradientBoostedTreesModel ¶
Bases: DecisionForestModel
A Gradient Boosted Trees model for prediction and inspection.
add_tree ¶
add_tree(tree: Tree) -> None
Adds a single tree of the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree
|
Tree
|
New tree. |
required |
analyze ¶
analyze(
data: InputDataset,
sampling: float = 1.0,
num_bins: int = 50,
partial_dependence_plot: bool = True,
conditional_expectation_plot: bool = True,
permutation_variable_importance_rounds: int = 1,
num_threads: Optional[int] = None,
maximum_duration: Optional[float] = 20,
) -> Analysis
benchmark ¶
benchmark(
ds: InputDataset,
benchmark_duration: float = 3,
warmup_duration: float = 1,
batch_size: int = 100,
num_threads: Optional[int] = None,
) -> BenchmarkInferenceCCResult
describe ¶
describe(
output_format: Literal[
"auto", "text", "notebook", "html"
] = "auto",
full_details: bool = False,
) -> Union[str, HtmlNotebookDisplay]
distance ¶
distance(
data1: InputDataset,
data2: Optional[InputDataset] = None,
) -> ndarray
Computes the pairwise distance between examples in "data1" and "data2".
If "data2" is not provided, computes the pairwise distance between examples in "data1".
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").Train(train_ds)
test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.
Different models are free to implement different distances with different definitions. For this reasons, unless indicated by the model, distances from different models cannot be compared.
The distance is not guaranteed to satisfy the triangular inequality property of metric distances.
Not all models can compute distances. In this case, this function will raise an Exception.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data1
|
InputDataset
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
required |
data2
|
Optional[InputDataset]
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
None
|
Returns:
Type | Description |
---|---|
ndarray
|
Pairwise distance |
evaluate ¶
evaluate(
data: InputDataset,
*,
weighted: Optional[bool] = None,
task: Optional[Task] = None,
label: Optional[str] = None,
group: Optional[str] = None,
bootstrapping: Union[bool, int] = False,
ndcg_truncation: int = 5,
mrr_truncation: int = 5,
evaluation_task: Optional[Task] = None,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> Evaluation
get_tree ¶
initial_predictions ¶
initial_predictions() -> NDArray[float]
Returns the model's initial predictions (i.e. the model bias).
input_feature_names ¶
Returns the names of the input features.
The features are sorted in increasing order of column_idx.
input_features ¶
input_features() -> Sequence[InputFeature]
Returns the input features of the model.
The features are sorted in increasing order of column_idx.
label_classes ¶
Returns the label classes for a classification model; fails otherwise.
num_trees_per_iteration ¶
num_trees_per_iteration() -> int
The number of trees trained per gradient boosting iteration.
plot_tree ¶
plot_tree(
tree_idx: int = 0,
max_depth: Optional[int] = None,
options: Optional[PlotOptions] = None,
d3js_url: str = "https://d3js.org/d3.v6.min.js",
) -> TreePlot
Plots an interactive HTML rendering of the tree.
Usage example:
# Create a dataset
train_ds = pd.DataFrame({
"c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
"label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, self.num_trees()). |
0
|
max_depth
|
Optional[int]
|
Maximum tree depth of the plot. Set to None for full depth. |
None
|
options
|
Optional[PlotOptions]
|
Advanced options for plotting. Set to None for default style. |
None
|
d3js_url
|
str
|
URL to load the d3.js library from. |
'https://d3js.org/d3.v6.min.js'
|
Returns:
Type | Description |
---|---|
TreePlot
|
In interactive environments, an interactive plot. The HTML source can also |
TreePlot
|
be exported to file. |
predict ¶
predict(
data: InputDataset,
*,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> ndarray
predict_class ¶
predict_class(
data: InputDataset,
*,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> ndarray
Returns the most likely predicted class for a classification model.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
test_ds = pd.read_csv("test.csv")
predictions = model.predict_class(test_ds)
This method returns a numpy array of string of shape [num_examples]
. Each
value represents the most likely class for the corresponding example. This
method can only be used for classification models.
In case of ties, the first class inmodel.label_classes()
is returned.
See model.predict
to generate the full prediction probabilities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
InputDataset
|
Dataset. Supported formats: VerticalDataset, (typed) path, list of (typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset, PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of string to NumPy array or lists. If the dataset contains the label column, that column is ignored. |
required |
use_slow_engine
|
bool
|
If true, uses the slow engine for making predictions. The slow engine of YDF is an order of magnitude slower than the other prediction engines. There exist very rare edge cases where predictions with the regular engines fail, e.g., models with a very large number of categorical conditions. It is only in these cases that users should use the slow engine and report the issue to the YDF developers. |
False
|
num_threads
|
Optional[int]
|
Number of threads used to run the model. |
None
|
Returns:
Type | Description |
---|---|
ndarray
|
The most likely predicted class for each example. |
predict_leaves ¶
Gets the index of the active leaf in each tree.
The active leaf is the leave that that receive the example during inference.
The returned value "leaves[i,j]" is the index of the active leaf for the i-th example and the j-th tree. Leaves are indexed by depth first exploration with the negative child visited before the positive one.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
InputDataset
|
Dataset. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Index of the active leaf for each tree in the model. |
print_tree ¶
Prints a tree in the terminal.
Usage example:
# Create a dataset
train_ds = pd.DataFrame({
"c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
"label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, self.num_trees()). |
0
|
max_depth
|
Optional[int]
|
Maximum tree depth of the plot. Set to None for full depth. |
6
|
file
|
Any
|
Where to print the tree. By default, prints on the terminal standard output. |
stdout
|
remove_tree ¶
remove_tree(tree_idx: int) -> None
Removes a single tree of the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, num_trees()). |
required |
self_evaluation ¶
self_evaluation() -> Optional[Evaluation]
Returns the model's self-evaluation.
For Gradient Boosted Trees models, the self-evaluation is the evaluation on the validation dataset. Note that the validation dataset is extracted automatically if not explicitly given. If the validation dataset is deactivated, no self-evaluation is computed.
Different models use different methods for self-evaluation. Notably, Random Forests use the last Out-Of-Bag evaluation. Therefore, self-evaluations are not comparable between different model types.
Returns None if no self-evaluation has been computed.
Usage example:
set_feature_selection_logs ¶
set_feature_selection_logs(
value: Optional[FeatureSelectorLogs],
) -> None
set_initial_predictions ¶
Sets the model's initial predictions (i.e. the model bias).
set_node_format ¶
set_node_format(node_format: NodeFormat) -> None
Set the serialization format for the nodes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node_format
|
NodeFormat
|
Node format to use when saving the model. |
required |
set_tree ¶
to_docker ¶
Exports the model to a Docker endpoint deployable on Cloud.
This function creates a directory containing a Dockerfile, the model and support files.
Usage example:
import ydf
# Train a model.
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(size=100),
"f2": np.random.random(size=100),
"l": np.random.randint(2, size=100),
})
# Export the model to a Docker endpoint.
model.to_docker(path="/tmp/my_model")
# Print instructions on how to use the model
!cat /tmp/my_model/readme.md
# Test the end-point locally
docker build --platform linux/amd64 -t ydf_predict_image /tmp/my_model
docker run --rm -p 8080:8080 -d ydf_predict_image
# Deploy the model on Google Cloud
gcloud run deploy ydf-predict --source /tmp/my_model
# Check the automatically created utility scripts "test_locally.sh" and
# "deploy_in_google_cloud.sh" for more examples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Directory where to create the Docker endpoint |
required |
exist_ok
|
bool
|
If false (default), fails if the directory already exist. If true, override the directory content if any. |
False
|
to_jax_function ¶
to_jax_function(
jit: bool = True,
apply_activation: bool = True,
leaves_as_params: bool = False,
compatibility: Union[str, Compatibility] = "XLA",
) -> JaxModel
to_tensorflow_function ¶
to_tensorflow_function(
temp_dir: Optional[str] = None,
can_be_saved: bool = True,
squeeze_binary_classification: bool = True,
force: bool = False,
) -> Module
to_tensorflow_saved_model ¶
to_tensorflow_saved_model(
path: str,
input_model_signature_fn: Any = None,
*,
mode: Literal["keras", "tf"] = "keras",
feature_dtypes: Dict[str, TFDType] = {},
servo_api: bool = False,
feed_example_proto: bool = False,
pre_processing: Optional[Callable] = None,
post_processing: Optional[Callable] = None,
temp_dir: Optional[str] = None,
tensor_specs: Optional[Dict[str, Any]] = None,
feature_specs: Optional[Dict[str, Any]] = None,
force: bool = False
) -> None
validation_evaluation ¶
validation_evaluation() -> Optional[Evaluation]
Returns the validation evaluation of the model, if available.
Gradient Boosted Trees use a validation dataset for early stopping.
Returns None if no validation evaluation been computed or it has been removed from the model.
Usage example:
validation_loss ¶
Returns loss on the validation dataset if available.