Custom lossĀ¶
In gradient boosted trees, the loss is a function that takes a label value and a prediction, and returns the "amount of error" of this prediction. The model is trained to minimize the average loss over all the training examples. YDF implements various common losses. You can configure them with the "loss" parameter. You can see the list of available losses here. If you don't specify the loss, it is selected automatically according to the model task. For instance, if the task is regression, the loss is set to mean-squared error by default.
If YDF does not support a loss you needs, you can define it manually. This is called a "custom loss".
In this introduction tutorial, we will create a custom Regression Loss called Mean Squared Logarithmic Error.
What is a custom loss?Ā¶
In YDF, a custom loss consists of four parts:
- Initial prediction: The initial prediction of the model, e.g. the average of the labels.
- Gradient and Hessian: A function that computes the gradient and the diagonal of the hessian of the loss given the label and the prediction of the model before the activation function (a.k.a. linkage function).
- Loss: A function that measures the quality of the current solution. While theory might dictate that the gradient and hessian are actually the gradient and hessian of the loss function, approximations do very well in practice.
- Activation: A function applied to the predictions to transform them to the correct space (e.g. probabilities for classification problems)
Training Gradient Boosted Trees with custom lossĀ¶
We start by setting up a regression dataset.
# Load libraries
import ydf # Yggdrasil Decision Forests
import pandas as pd # We use Pandas to load small datasets
import numpy as np # We use numpy for numerical operation
import numpy.typing as npty
from typing import Tuple
# Download a regression dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
all_ds = pd.read_csv(f"{ds_path}/abalone.csv")
# Randomly split the dataset into a training (70%) and testing (30%) dataset
all_ds = all_ds.sample(frac=1)
split_idx = len(all_ds) * 7 // 10
train_ds = all_ds.iloc[:split_idx]
test_ds = all_ds.iloc[split_idx:]
# Print the first 5 training examples
train_ds.head(5)
Mean Squared Logarithmic ErrorĀ¶
We use Mean Squared Logarithmic Error (MSLE) loss for this tutorial. The MSLE is calculated as
MSLE = $\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2$,
where $n$ is the total number of observations, $p_i$ and $a_i$ are the prediction and label of example $i$, respectively, and $\log$ denotes the natural logarithm.
The gradient of the MSLE loss with respect to the prediction $p_i$ is
$\frac{1}{n} \cdot \frac{2(\log(p_i + 1) - \log(a_i+1))}{p_i + 1}$
The hessian of the MSLE loss is a matrix. For simplicity and performance reasons, YDF only uses the diagonal of the hessian. The $i$th element of the diagonal is
$\frac{1}{n} \cdot \frac{2(1 - \log(p_i + 1) + \log(a_i+1))}{(p_i + 1)^2}$
# If predictions are close to -1, numerical instabilities will distort the
# results. The predictions are therefore capped slightly above -1.
PREDICTION_MINIMUM = -1 + 1e-6
def loss_msle(
labels: npty.NDArray[np.float32],
predictions: npty.NDArray[np.float32],
weights: npty.NDArray[np.float32],
) -> np.float32:
clipped_pred = np.maximum(PREDICTION_MINIMUM, predictions)
return np.sum((np.log1p(clipped_pred) - np.log1p(labels))**2) / len(labels)
def initial_predictions_msle(
labels: npty.NDArray[np.float32], _: npty.NDArray[np.float32]
) -> npty.NDArray[np.float32]:
return np.exp(np.mean(np.log1p(labels))) - 1
def grad_msle(
labels: npty.NDArray[np.float32], predictions: npty.NDArray[np.float32]
) -> npty.NDArray[np.float32]:
gradient = (2/ len(labels))*(np.log1p(predictions) - np.log1p(labels)) / (predictions + 1)
return gradient
def hessian_msle(
labels: npty.NDArray[np.float32], predictions: npty.NDArray[np.float32]
) -> npty.NDArray[np.float32]:
hessian = (2/ len(labels))*(1 - np.log1p(predictions) + np.log1p(labels)) / (predictions + 1)**2
return hessian
def gradient_and_hessian_msle(
labels: npty.NDArray[np.float32], predictions: npty.NDArray[np.float32]
) -> Tuple[npty.NDArray[np.float32], npty.NDArray[np.float32]]:
clipped_pred = np.maximum(PREDICTION_MINIMUM, predictions)
return [grad_msle(labels, clipped_pred), hessian_msle(labels, clipped_pred)]
# Construct the loss object.
msle_custom_loss = ydf.RegressionLoss(
initial_predictions=initial_predictions_msle,
gradient_and_hessian=gradient_and_hessian_msle,
loss=loss_msle,
activation=ydf.Activation.IDENTITY,
)
The model is trained as usual with the loss object as a hyperparameter.
model = ydf.GradientBoostedTreesLearner(label="Rings", task=ydf.Task.REGRESSION, loss=msle_custom_loss).train(train_ds)
The model description shows the evolution of training loss and validation loss.
model.describe()
We can compare this model to a model trained with RMSE loss.
model.evaluate(test_ds)
# A model trained with default regression loss (i.e. RMSE loss)
model_rmse_loss = ydf.GradientBoostedTreesLearner(label="Rings", task=ydf.Task.REGRESSION).train(train_ds)
model_rmse_loss.evaluate(test_ds)
Other custom lossesĀ¶
Binary ClassificationĀ¶
For binary classification problems, the labels are integers (1 for the positive class and 2 for the negative class). The model is expected to return the probability of the positive class. YDF supports the Sigmoid activation function for losses that do not operate in the probability space.
For demonstration purposes, the code below re-implements the
Binomial Log Likelihood Loss as a custom loss.
Note that this loss is also available directly through the
loss=BINOMIAL_LOG_LIKELIHOOD
hyperparameter.
def binomial_initial_predictions(
labels: npty.NDArray[np.int32], weights: npty.NDArray[np.float32]
) -> np.float32:
sum_weights = np.sum(weights)
sum_weights_positive = np.sum((labels == 2) * weights)
ratio_positive = sum_weights_positive / sum_weights
if ratio_positive == 0.0:
return -np.iinfo(np.float32).max
elif ratio_positive == 1.0:
return np.iinfo(np.float32).max
return np.log(ratio_positive / (1 - ratio_positive))
def binomial_gradient_and_hessian(
labels: npty.NDArray[np.int32], predictions: npty.NDArray[np.float32]
) -> Tuple[npty.NDArray[np.float32], npty.NDArray[np.float32]]:
pred_probability = 1.0 / (1.0 + np.exp(-predictions))
binary_labels = labels == 2
return (
pred_probability - binary_labels,
pred_probability * (pred_probability - 1),
)
def binomial_loss(
labels: npty.NDArray[np.int32],
predictions: npty.NDArray[np.float32],
weights: npty.NDArray[np.float32],
) -> np.float32:
binary_labels = labels == 2
return (-2.0 * np.sum(
binary_labels * predictions- np.log(1.0 + np.exp(predictions))
) / len(labels)
)
binomial_custom_loss = ydf.BinaryClassificationLoss(
initial_predictions=binomial_initial_predictions,
gradient_and_hessian=binomial_gradient_and_hessian,
loss=binomial_loss,
activation=ydf.Activation.SIGMOID,
)
Multi-class classificationĀ¶
For multi-class classification problems, the labels are integers starting with 1. The loss function must provide a gradient and hessian for each label class. The gradient and hessian must return d-by-n matrices, where n is the number of examples and d is the number of label classes. Similarly, the model must provide an initial prediction for each label class as as a vector of d elements.
YDF supports the Softmax activation function for losses that do not operate in the probability space.
For demonstration purposes, the code below re-implements the
Multinomial Log Likelihood Loss as a custom loss. Note that this loss is
also available directly through the loss=MULTINOMIAL_LOG_LIKELIHOOD
hyperparameter.
def multinomial_initial_predictions(
labels: npty.NDArray[np.int32], _: npty.NDArray[np.float32]
) -> npty.NDArray[np.float32]:
dimension = np.max(labels)
return np.zeros(dimension, dtype=np.float32)
def multinomial_gradient(
labels: npty.NDArray[np.int32], predictions: npty.NDArray[np.float32]
) -> Tuple[npty.NDArray[np.float32], npty.NDArray[np.float32]]:
dimension = np.max(labels)
normalization = 1.0 / np.sum(np.exp(predictions), axis=1)
normalized_predictions = np.exp(predictions) * normalization[:, None]
label_indicator = (
(labels - 1)[:, np.newaxis] == np.arange(dimension)
).astype(int)
gradient = normalized_predictions - label_indicator
hessian = np.abs(gradient) * (np.abs(gradient) - 1)
return (np.transpose(gradient), np.transpose(hessian))
def multinomial_loss(
labels: npty.NDArray[np.int32],
predictions: npty.NDArray[np.float32],
weights: npty.NDArray[np.float32],
) -> np.float32:
dimension = np.max(labels)
sum_exp_pred = np.sum(np.exp(predictions), axis=1)
indicator_matrix = (
(labels - 1)[:, np.newaxis] == np.arange(dimension)
).astype(int)
label_exp_pred = np.exp(np.sum(predictions * indicator_matrix, axis=1))
return (
-np.sum(np.log(label_exp_pred / sum_exp_pred)) / len(labels)
)
multinomial_custom_loss = ydf.MultiClassificationLoss(
initial_predictions=multinomial_initial_predictions,
gradient_and_hessian=multinomial_gradient,
loss=multinomial_loss,
activation=ydf.Activation.SOFTMAX,
)
Custom losses with JAXĀ¶
JAX allows defining losses with auto-differentiation. In this example, we define the Huber loss for Regression.
import jax
import jax.numpy as jnp
@jax.jit
def huber_loss(labels, pred, delta=1.0):
abs_diff = jnp.abs(labels - pred)
return jnp.average(jnp.where(abs_diff > delta,delta * (abs_diff - .5 * delta), 0.5 * abs_diff ** 2))
huber_grad = jax.jit(jax.grad(huber_loss, argnums=1))
huber_hessian = jax.jit(jax.jacfwd(jax.jacrev(huber_loss, argnums=1)))
huber_init = jax.jit(lambda labels, weights: jnp.average(labels))
huber = ydf.RegressionLoss(
initial_predictions=jax.block_until_ready(huber_init),
gradient_and_hessian=lambda label, pred: (
huber_grad(label, pred).block_until_ready(),
jnp.diagonal(huber_hessian(label, pred)).block_until_ready()
),
loss=lambda label, pred, weight: huber_loss(label, pred).block_until_ready(),
activation=ydf.Activation.IDENTITY,
)
model = ydf.GradientBoostedTreesLearner(label="Rings", task=ydf.Task.REGRESSION, loss=huber).train(train_ds)
Additional details and tipsĀ¶
- For simplicity of exposition, the examples above assume unit weights.
- Loss functions should not create references to the labels, predictions and weights arrays. These arrays are backed by C++ memory and might be deleted on the C++ side at any time.
- When using custom losses, YDF may trigger the GC to catch illegal memory accesses. Set
may_trigger_gc=False
on the loss object to avoid this, but be aware that YDF may not warn about illegal memory accesses then. - The arrays returned by the custom loss functions may be modified by YDF.
- Training with custom losses is often ~10% slower than training built-in losses.
- Custom losses are not fully supported for model inspection and analysis - it is not yet possible to compute the model's custom loss on a test set in YDF.