RandomForestModel
- RandomForestModel
- add_tree
- analyze
- analyze_prediction
- benchmark
- data_spec
- describe
- distance
- evaluate
- feature_selection_logs
- force_engine
- get_all_trees
- get_tree
- hyperparameter_optimizer_logs
- input_feature_names
- input_features
- input_features_col_idxs
- iter_trees
- label
- label_classes
- label_col_idx
- list_compatible_engines
- metadata
- name
- num_trees
- out_of_bag_evaluations
- plot_tree
- predict
- predict_class
- predict_leaves
- print_tree
- remove_tree
- save
- self_evaluation
- serialize
- set_data_spec
- set_feature_selection_logs
- set_metadata
- set_node_format
- set_tree
- task
- to_cpp
- to_docker
- to_jax_function
- to_tensorflow_function
- to_tensorflow_saved_model
- update_with_jax_params
- variable_importances
- winner_takes_all
RandomForestModel ¶
Bases: DecisionForestModel
A Random Forest model for prediction and inspection.
add_tree ¶
add_tree(tree: Tree) -> None
Adds a single tree of the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree
|
Tree
|
New tree. |
required |
analyze ¶
analyze(
data: InputDataset,
sampling: float = 1.0,
num_bins: int = 50,
partial_dependence_plot: bool = True,
conditional_expectation_plot: bool = True,
permutation_variable_importance_rounds: int = 1,
num_threads: Optional[int] = None,
maximum_duration: Optional[float] = 20,
) -> Analysis
benchmark ¶
benchmark(
ds: InputDataset,
benchmark_duration: float = 3,
warmup_duration: float = 1,
batch_size: int = 100,
num_threads: Optional[int] = None,
) -> BenchmarkInferenceCCResult
describe ¶
describe(
output_format: Literal[
"auto", "text", "notebook", "html"
] = "auto",
full_details: bool = False,
) -> Union[str, HtmlNotebookDisplay]
distance ¶
distance(
data1: InputDataset,
data2: Optional[InputDataset] = None,
) -> ndarray
Computes the pairwise distance between examples in "data1" and "data2".
If "data2" is not provided, computes the pairwise distance between examples in "data1".
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").Train(train_ds)
test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.
Different models are free to implement different distances with different definitions. For this reasons, unless indicated by the model, distances from different models cannot be compared.
The distance is not guaranteed to satisfy the triangular inequality property of metric distances.
Not all models can compute distances. In this case, this function will raise an Exception.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data1
|
InputDataset
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
required |
data2
|
Optional[InputDataset]
|
Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset. |
None
|
Returns:
Type | Description |
---|---|
ndarray
|
Pairwise distance |
evaluate ¶
evaluate(
data: InputDataset,
*,
weighted: Optional[bool] = None,
task: Optional[Task] = None,
label: Optional[str] = None,
group: Optional[str] = None,
bootstrapping: Union[bool, int] = False,
ndcg_truncation: int = 5,
mrr_truncation: int = 5,
evaluation_task: Optional[Task] = None,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> Evaluation
get_tree ¶
input_feature_names ¶
Returns the names of the input features.
The features are sorted in increasing order of column_idx.
input_features ¶
input_features() -> Sequence[InputFeature]
Returns the input features of the model.
The features are sorted in increasing order of column_idx.
label_classes ¶
Returns the label classes for a classification model; fails otherwise.
out_of_bag_evaluations ¶
out_of_bag_evaluations() -> Sequence[OutOfBagEvaluation]
Returns the Out-Of-Bag evaluations of the model, if available.
Each tree in a random forest is only trained on a fraction of the training examples. Out-of-bag (OOB) evaluations evaluate each training example on the trees that have not seen it in training. This creates a self-evaluation method that does not require a training dataset. See https://developers.google.com/machine-learning/decision-forests/out-of-bag for details.
Computing OOB metrics slows down training and requires hyperparameter
compute_oob_performances
to be set. The learner then computes the OOB
evaluation at regular intervals during the training. The returned list of
evaluations is sorted by the number of trees and its last element is the OOB
evaluation of the full model.
If no OOB evaluations have been computed, an empty list is returned.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
learner = ydf.RandomForestLearner(label="label",
compute_oob_performances=True)
model = learner.train(train_ds)
oob_evaluations = model.out_of_bag_evaluations()
# In an interactive Python environment, print a rich evaluation report.
oob_evaluations[-1].evaluation
plot_tree ¶
plot_tree(
tree_idx: int = 0,
max_depth: Optional[int] = None,
options: Optional[PlotOptions] = None,
d3js_url: str = "https://d3js.org/d3.v6.min.js",
) -> TreePlot
Plots an interactive HTML rendering of the tree.
Usage example:
# Create a dataset
train_ds = pd.DataFrame({
"c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
"label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, self.num_trees()). |
0
|
max_depth
|
Optional[int]
|
Maximum tree depth of the plot. Set to None for full depth. |
None
|
options
|
Optional[PlotOptions]
|
Advanced options for plotting. Set to None for default style. |
None
|
d3js_url
|
str
|
URL to load the d3.js library from. |
'https://d3js.org/d3.v6.min.js'
|
Returns:
Type | Description |
---|---|
TreePlot
|
In interactive environments, an interactive plot. The HTML source can also |
TreePlot
|
be exported to file. |
predict ¶
predict(
data: InputDataset,
*,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> ndarray
predict_class ¶
predict_class(
data: InputDataset,
*,
use_slow_engine: bool = False,
num_threads: Optional[int] = None
) -> ndarray
Returns the most likely predicted class for a classification model.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)
test_ds = pd.read_csv("test.csv")
predictions = model.predict_class(test_ds)
This method returns a numpy array of string of shape [num_examples]
. Each
value represents the most likely class for the corresponding example. This
method can only be used for classification models.
In case of ties, the first class inmodel.label_classes()
is returned.
See model.predict
to generate the full prediction probabilities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
InputDataset
|
Dataset. Supported formats: VerticalDataset, (typed) path, list of (typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset, PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of string to NumPy array or lists. If the dataset contains the label column, that column is ignored. |
required |
use_slow_engine
|
bool
|
If true, uses the slow engine for making predictions. The slow engine of YDF is an order of magnitude slower than the other prediction engines. There exist very rare edge cases where predictions with the regular engines fail, e.g., models with a very large number of categorical conditions. It is only in these cases that users should use the slow engine and report the issue to the YDF developers. |
False
|
num_threads
|
Optional[int]
|
Number of threads used to run the model. |
None
|
Returns:
Type | Description |
---|---|
ndarray
|
The most likely predicted class for each example. |
predict_leaves ¶
Gets the index of the active leaf in each tree.
The active leaf is the leave that that receive the example during inference.
The returned value "leaves[i,j]" is the index of the active leaf for the i-th example and the j-th tree. Leaves are indexed by depth first exploration with the negative child visited before the positive one.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
InputDataset
|
Dataset. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Index of the active leaf for each tree in the model. |
print_tree ¶
Prints a tree in the terminal.
Usage example:
# Create a dataset
train_ds = pd.DataFrame({
"c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
"label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, self.num_trees()). |
0
|
max_depth
|
Optional[int]
|
Maximum tree depth of the plot. Set to None for full depth. |
6
|
file
|
Any
|
Where to print the tree. By default, prints on the terminal standard output. |
stdout
|
remove_tree ¶
remove_tree(tree_idx: int) -> None
Removes a single tree of the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_idx
|
int
|
Index of the tree. Should be in [0, num_trees()). |
required |
self_evaluation ¶
self_evaluation() -> Optional[Evaluation]
Returns the model's self-evaluation.
For Random Forest models, the self-evaluation is out-of-bag evaluation on the full model. Note that the Random Forest models do not use a validation dataset. If out-of-bag evaluation is not enabled, no self-evaluation is computed.
Different models use different methods for self-evaluation. Notably, Gradient Boosted Trees use the evaluation on the validation dataset. Therefore, self-evaluations are not comparable between different model types.
Returns None if no self-evaluation has been computed.
Usage example:
import pandas as pd
import ydf
# Train model
train_ds = pd.read_csv("train.csv")
learner = ydf.RandomForestLearner(label="label",
compute_oob_performances=True)
model = learner.train(train_ds)
self_evaluation = model.self_evaluation()
# In an interactive Python environment, print a rich evaluation report.
self_evaluation
set_feature_selection_logs ¶
set_feature_selection_logs(
value: Optional[FeatureSelectorLogs],
) -> None
set_node_format ¶
set_node_format(node_format: NodeFormat) -> None
Set the serialization format for the nodes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node_format
|
NodeFormat
|
Node format to use when saving the model. |
required |
set_tree ¶
to_docker ¶
Exports the model to a Docker endpoint deployable on Cloud.
This function creates a directory containing a Dockerfile, the model and support files.
Usage example:
import ydf
# Train a model.
model = ydf.RandomForestLearner(label="l").train({
"f1": np.random.random(size=100),
"f2": np.random.random(size=100),
"l": np.random.randint(2, size=100),
})
# Export the model to a Docker endpoint.
model.to_docker(path="/tmp/my_model")
# Print instructions on how to use the model
!cat /tmp/my_model/readme.md
# Test the end-point locally
docker build --platform linux/amd64 -t ydf_predict_image /tmp/my_model
docker run --rm -p 8080:8080 -d ydf_predict_image
# Deploy the model on Google Cloud
gcloud run deploy ydf-predict --source /tmp/my_model
# Check the automatically created utility scripts "test_locally.sh" and
# "deploy_in_google_cloud.sh" for more examples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Directory where to create the Docker endpoint |
required |
exist_ok
|
bool
|
If false (default), fails if the directory already exist. If true, override the directory content if any. |
False
|
to_jax_function ¶
to_jax_function(
jit: bool = True,
apply_activation: bool = True,
leaves_as_params: bool = False,
compatibility: Union[str, Compatibility] = "XLA",
) -> JaxModel
to_tensorflow_function ¶
to_tensorflow_function(
temp_dir: Optional[str] = None,
can_be_saved: bool = True,
squeeze_binary_classification: bool = True,
force: bool = False,
) -> Module
to_tensorflow_saved_model ¶
to_tensorflow_saved_model(
path: str,
input_model_signature_fn: Any = None,
*,
mode: Literal["keras", "tf"] = "keras",
feature_dtypes: Dict[str, TFDType] = {},
servo_api: bool = False,
feed_example_proto: bool = False,
pre_processing: Optional[Callable] = None,
post_processing: Optional[Callable] = None,
temp_dir: Optional[str] = None,
tensor_specs: Optional[Dict[str, Any]] = None,
feature_specs: Optional[Dict[str, Any]] = None,
force: bool = False
) -> None
winner_takes_all ¶
winner_takes_all() -> bool
Returns if the model uses a winner-takes-all strategy for classification.
This parameter determines how to aggregate individual tree votes during
inference in a classification random forest. It is defined by the
winner_take_all
Random Forest learner hyper-parameter,
If true, each tree votes for a single class, which is the traditional random forest inference method. If false, each tree outputs a probability distribution across all classes.
If the model is not a classification model, the return value of this function is arbitrary and does not influence model inference.