Skip to content

RandomForestModel

RandomForestModel

RandomForestModel(raw_model: GenericCCModel)

Bases: DecisionForestModel

A Random Forest model for prediction and inspection.

add_tree

add_tree(tree: Tree) -> None

Adds a single tree of the model.

Parameters:

Name Type Description Default
tree Tree

New tree.

required

analyze

analyze(
    data: InputDataset,
    sampling: float = 1.0,
    num_bins: int = 50,
    partial_dependence_plot: bool = True,
    conditional_expectation_plot: bool = True,
    permutation_variable_importance_rounds: int = 1,
    num_threads: Optional[int] = None,
    maximum_duration: Optional[float] = 20,
) -> Analysis

analyze_prediction

analyze_prediction(
    single_example: InputDataset,
) -> PredictionAnalysis

benchmark

benchmark(
    ds: InputDataset,
    benchmark_duration: float = 3,
    warmup_duration: float = 1,
    batch_size: int = 100,
    num_threads: Optional[int] = None,
) -> BenchmarkInferenceCCResult

data_spec

data_spec() -> DataSpecification

describe

describe(
    output_format: Literal[
        "auto", "text", "notebook", "html"
    ] = "auto",
    full_details: bool = False,
) -> Union[str, HtmlNotebookDisplay]

distance

distance(
    data1: InputDataset,
    data2: Optional[InputDataset] = None,
) -> ndarray

Computes the pairwise distance between examples in "data1" and "data2".

If "data2" is not provided, computes the pairwise distance between examples in "data1".

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").Train(train_ds)

test_ds = pd.read_csv("test.csv")
distances = model.distance(test_ds, train_ds)
# "distances[i,j]" is the distance between the i-th test example and the
# j-th train example.

Different models are free to implement different distances with different definitions. For this reasons, unless indicated by the model, distances from different models cannot be compared.

The distance is not guaranteed to satisfy the triangular inequality property of metric distances.

Not all models can compute distances. In this case, this function will raise an Exception.

Parameters:

Name Type Description Default
data1 InputDataset

Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.

required
data2 Optional[InputDataset]

Dataset. Can be a dictionary of list or numpy array of values, Pandas DataFrame, or a VerticalDataset.

None

Returns:

Type Description
ndarray

Pairwise distance

evaluate

evaluate(
    data: InputDataset,
    *,
    weighted: Optional[bool] = None,
    task: Optional[Task] = None,
    label: Optional[str] = None,
    group: Optional[str] = None,
    bootstrapping: Union[bool, int] = False,
    ndcg_truncation: int = 5,
    mrr_truncation: int = 5,
    evaluation_task: Optional[Task] = None,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> Evaluation

feature_selection_logs

feature_selection_logs() -> Optional[FeatureSelectorLogs]

force_engine

force_engine(engine_name: Optional[str]) -> None

get_all_trees

get_all_trees() -> Sequence[Tree]

Returns all the trees in the model.

get_tree

get_tree(tree_idx: int) -> Tree

Gets a single tree of the model.

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, num_trees()).

required

Returns:

Type Description
Tree

The tree.

hyperparameter_optimizer_logs

hyperparameter_optimizer_logs() -> Optional[OptimizerLogs]

input_feature_names

input_feature_names() -> List[str]

Returns the names of the input features.

The features are sorted in increasing order of column_idx.

input_features

input_features() -> Sequence[InputFeature]

Returns the input features of the model.

The features are sorted in increasing order of column_idx.

input_features_col_idxs

input_features_col_idxs() -> Sequence[int]

iter_trees

iter_trees() -> Iterator[Tree]

Returns an iterator over all the trees in the model.

label

label() -> str

Name of the label column.

label_classes

label_classes() -> List[str]

Returns the label classes for a classification model; fails otherwise.

label_col_idx

label_col_idx() -> int

list_compatible_engines

list_compatible_engines() -> Sequence[str]

metadata

metadata() -> ModelMetadata

name

name() -> str

num_trees

num_trees()

Returns the number of trees in the decision forest.

out_of_bag_evaluations

out_of_bag_evaluations() -> Sequence[OutOfBagEvaluation]

Returns the Out-Of-Bag evaluations of the model, if available.

Each tree in a random forest is only trained on a fraction of the training examples. Out-of-bag (OOB) evaluations evaluate each training example on the trees that have not seen it in training. This creates a self-evaluation method that does not require a training dataset. See https://developers.google.com/machine-learning/decision-forests/out-of-bag for details.

Computing OOB metrics slows down training and requires hyperparameter compute_oob_performances to be set. The learner then computes the OOB evaluation at regular intervals during the training. The returned list of evaluations is sorted by the number of trees and its last element is the OOB evaluation of the full model.

If no OOB evaluations have been computed, an empty list is returned.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
learner = ydf.RandomForestLearner(label="label",
                                  compute_oob_performances=True)
model = learner.train(train_ds)

oob_evaluations = model.out_of_bag_evaluations()
# In an interactive Python environment, print a rich evaluation report.
oob_evaluations[-1].evaluation

plot_tree

plot_tree(
    tree_idx: int = 0,
    max_depth: Optional[int] = None,
    options: Optional[PlotOptions] = None,
    d3js_url: str = "https://d3js.org/d3.v6.min.js",
) -> TreePlot

Plots an interactive HTML rendering of the tree.

Usage example:

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Plot the tree in Colab
model.plot_tree()

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, self.num_trees()).

0
max_depth Optional[int]

Maximum tree depth of the plot. Set to None for full depth.

None
options Optional[PlotOptions]

Advanced options for plotting. Set to None for default style.

None
d3js_url str

URL to load the d3.js library from.

'https://d3js.org/d3.v6.min.js'

Returns:

Type Description
TreePlot

In interactive environments, an interactive plot. The HTML source can also

TreePlot

be exported to file.

predict

predict(
    data: InputDataset,
    *,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> ndarray

predict_class

predict_class(
    data: InputDataset,
    *,
    use_slow_engine: bool = False,
    num_threads: Optional[int] = None
) -> ndarray

Returns the most likely predicted class for a classification model.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
model = ydf.RandomForestLearner(label="label").train(train_ds)

test_ds = pd.read_csv("test.csv")
predictions = model.predict_class(test_ds)

This method returns a numpy array of string of shape [num_examples]. Each value represents the most likely class for the corresponding example. This method can only be used for classification models.

In case of ties, the first class inmodel.label_classes() is returned.

See model.predict to generate the full prediction probabilities.

Parameters:

Name Type Description Default
data InputDataset

Dataset. Supported formats: VerticalDataset, (typed) path, list of (typed) paths, Pandas DataFrame, Xarray Dataset, TensorFlow Dataset, PyGrain DataLoader and Dataset (experimental, Linux only), dictionary of string to NumPy array or lists. If the dataset contains the label column, that column is ignored.

required
use_slow_engine bool

If true, uses the slow engine for making predictions. The slow engine of YDF is an order of magnitude slower than the other prediction engines. There exist very rare edge cases where predictions with the regular engines fail, e.g., models with a very large number of categorical conditions. It is only in these cases that users should use the slow engine and report the issue to the YDF developers.

False
num_threads Optional[int]

Number of threads used to run the model.

None

Returns:

Type Description
ndarray

The most likely predicted class for each example.

predict_leaves

predict_leaves(data: InputDataset) -> ndarray

Gets the index of the active leaf in each tree.

The active leaf is the leave that that receive the example during inference.

The returned value "leaves[i,j]" is the index of the active leaf for the i-th example and the j-th tree. Leaves are indexed by depth first exploration with the negative child visited before the positive one.

Parameters:

Name Type Description Default
data InputDataset

Dataset.

required

Returns:

Type Description
ndarray

Index of the active leaf for each tree in the model.

print_tree

print_tree(
    tree_idx: int = 0,
    max_depth: Optional[int] = 6,
    file: Any = stdout,
) -> None

Prints a tree in the terminal.

Usage example:

# Create a dataset
train_ds = pd.DataFrame({
    "c1": [1.0, 1.1, 2.0, 3.5, 4.2] + list(range(10)),
    "label": ["a", "b", "b", "a", "a"] * 3,
})
# Train a CART model
model = ydf.CartLearner(label="label").train(train_ds)
# Make sure the model is a CART
assert isinstance(model, ydf.CARTModel)
# Print the tree
model.print_tree()

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, self.num_trees()).

0
max_depth Optional[int]

Maximum tree depth of the plot. Set to None for full depth.

6
file Any

Where to print the tree. By default, prints on the terminal standard output.

stdout

remove_tree

remove_tree(tree_idx: int) -> None

Removes a single tree of the model.

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, num_trees()).

required

save

save(
    path: str,
    advanced_options=ModelIOOptions(),
    *,
    pure_serving=False
) -> None

self_evaluation

self_evaluation() -> Optional[Evaluation]

Returns the model's self-evaluation.

For Random Forest models, the self-evaluation is out-of-bag evaluation on the full model. Note that the Random Forest models do not use a validation dataset. If out-of-bag evaluation is not enabled, no self-evaluation is computed.

Different models use different methods for self-evaluation. Notably, Gradient Boosted Trees use the evaluation on the validation dataset. Therefore, self-evaluations are not comparable between different model types.

Returns None if no self-evaluation has been computed.

Usage example:

import pandas as pd
import ydf

# Train model
train_ds = pd.read_csv("train.csv")
learner = ydf.RandomForestLearner(label="label",
                                compute_oob_performances=True)
model = learner.train(train_ds)

self_evaluation = model.self_evaluation()
# In an interactive Python environment, print a rich evaluation report.
self_evaluation

serialize

serialize() -> bytes

set_data_spec

set_data_spec(data_spec: DataSpecification) -> None

set_feature_selection_logs

set_feature_selection_logs(
    value: Optional[FeatureSelectorLogs],
) -> None

set_metadata

set_metadata(metadata: ModelMetadata)

set_node_format

set_node_format(node_format: NodeFormat) -> None

Set the serialization format for the nodes.

Parameters:

Name Type Description Default
node_format NodeFormat

Node format to use when saving the model.

required

set_tree

set_tree(tree_idx: int, tree: Tree) -> None

Overrides a single tree of the model.

Parameters:

Name Type Description Default
tree_idx int

Index of the tree. Should be in [0, num_trees()).

required
tree Tree

New tree.

required

task

task() -> Task

to_cpp

to_cpp(key: str = 'my_model') -> str

to_docker

to_docker(path: str, exist_ok: bool = False) -> None

Exports the model to a Docker endpoint deployable on Cloud.

This function creates a directory containing a Dockerfile, the model and support files.

Usage example:

import ydf

# Train a model.
model = ydf.RandomForestLearner(label="l").train({
    "f1": np.random.random(size=100),
    "f2": np.random.random(size=100),
    "l": np.random.randint(2, size=100),
})

# Export the model to a Docker endpoint.
model.to_docker(path="/tmp/my_model")

# Print instructions on how to use the model
!cat /tmp/my_model/readme.md

# Test the end-point locally
docker build --platform linux/amd64 -t ydf_predict_image /tmp/my_model
docker run --rm -p 8080:8080 -d ydf_predict_image

# Deploy the model on Google Cloud
gcloud run deploy ydf-predict --source /tmp/my_model

# Check the automatically created utility scripts "test_locally.sh" and
# "deploy_in_google_cloud.sh" for more examples.

Parameters:

Name Type Description Default
path str

Directory where to create the Docker endpoint

required
exist_ok bool

If false (default), fails if the directory already exist. If true, override the directory content if any.

False

to_jax_function

to_jax_function(
    jit: bool = True,
    apply_activation: bool = True,
    leaves_as_params: bool = False,
    compatibility: Union[str, Compatibility] = "XLA",
) -> JaxModel

to_tensorflow_function

to_tensorflow_function(
    temp_dir: Optional[str] = None,
    can_be_saved: bool = True,
    squeeze_binary_classification: bool = True,
    force: bool = False,
) -> Module

to_tensorflow_saved_model

to_tensorflow_saved_model(
    path: str,
    input_model_signature_fn: Any = None,
    *,
    mode: Literal["keras", "tf"] = "keras",
    feature_dtypes: Dict[str, TFDType] = {},
    servo_api: bool = False,
    feed_example_proto: bool = False,
    pre_processing: Optional[Callable] = None,
    post_processing: Optional[Callable] = None,
    temp_dir: Optional[str] = None,
    tensor_specs: Optional[Dict[str, Any]] = None,
    feature_specs: Optional[Dict[str, Any]] = None,
    force: bool = False
) -> None

update_with_jax_params

update_with_jax_params(params: Dict[str, Any])

variable_importances

variable_importances() -> (
    Dict[str, List[Tuple[float, str]]]
)

winner_takes_all

winner_takes_all() -> bool

Returns if the model uses a winner-takes-all strategy for classification.

This parameter determines how to aggregate individual tree votes during inference in a classification random forest. It is defined by the winner_take_all Random Forest learner hyper-parameter,

If true, each tree votes for a single class, which is the traditional random forest inference method. If false, each tree outputs a probability distribution across all classes.

If the model is not a classification model, the return value of this function is arbitrary and does not influence model inference.