tf.data.Dataset¶

Setup¶

In [ ]:

Copied!

pip install ydf -U
pip install ydf -U

What is a tf.data.Dataset ?¶

tf.data.Dataset is a runtime dataset format for the TensorFlow and JAX machine learning libraries. It makes it easy to load datasets from many different formats and apply transformations to them. Yggdrasil Decision Forests (YDF) can natively consume tf.data.Datasets.

A tf.data.Dataset should not be confused with tf.Dataset, which is a collection of datasets for ML practitioners. Note that some of the datasets in tf.Dataset are also available as tf.data.Dataset.

When using tf.data.Dataset with YDF:

Make sure that the dataset is finite i.e., it does not repeat infinitely. Do not shuffle the dataset.
Unlike neural networks, the batch size of the dataset does not affect YDF models. However, small batch sizes can be slow for TensorFlow. Therefore, it is recommended to use a large batch size. For example, 1000 is a good rule of thumb value.

Create a tf.data.Dataset¶

There are several ways to create tf.data.Datasets. Here, we use tf.data.Dataset.from_tensor_slices to convert a python list array into a tf.data.Dataset. This is for the sake of example only, as it is more efficient to feed a NumPy array directly to YDF.

In [1]:

Copied!

import ydf
import numpy as np
import tensorflow as tf
import ydf
import numpy as np
import tensorflow as tf

2023-11-19 18:08:44.092683: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-19 18:08:44.143396: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-19 18:08:44.144583: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-19 18:08:45.101126: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Let's download a dataset stored in the TFRecord format. TFRecord is a container format commonly used to store serialized TensorFlow Example protos. TFRecord files are typically compressed using gzip compression. When opening a compressed TFRecord file, you must specify the `compression_type`` to avoid encountering an invalid file error.

In [3]:

Copied!

!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_train.recordio.gz -q
!wget https://github.com/google/yggdrasil-decision-forests/raw/main/yggdrasil_decision_forests/test_data/dataset/adult_test.recordio.gz -q

Unlike `pandas.read_csv``, when reading a TFRecord with tf.data.Dataset, you must specify the feature you are loading.

In [18]:

Copied!





def create_tf_data_dataset(path):
    serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP")

    def parse_tf_example(serialized_example):
        """Parse a binary serialized tf.Example."""
        return tf.io.parse_single_example(
            serialized_example,
            {
                "age": tf.io.FixedLenFeature([], dtype=tf.int64),
                "capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64),
                "hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64),
                "workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""),
                "education": tf.io.FixedLenFeature([], dtype=tf.string),
                "income": tf.io.FixedLenFeature([], dtype=tf.string),
                # Those are just a few features available in the dataset.
            }
        )

    return serialized_examples.map(parse_tf_example)

non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz")
non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")
def create_tf_data_dataset(path):
    serialized_examples = tf.data.TFRecordDataset(filenames=[path], compression_type="GZIP")

    def parse_tf_example(serialized_example):
        """Parse a binary serialized tf.Example."""
        return tf.io.parse_single_example(
            serialized_example,
            {
                "age": tf.io.FixedLenFeature([], dtype=tf.int64),
                "capital_gain": tf.io.FixedLenFeature([], dtype=tf.int64),
                "hours_per_week": tf.io.FixedLenFeature([], dtype=tf.int64),
                "workclass": tf.io.FixedLenFeature([], dtype=tf.string, default_value=""),
                "education": tf.io.FixedLenFeature([], dtype=tf.string),
                "income": tf.io.FixedLenFeature([], dtype=tf.string),
                # Those are just a few features available in the dataset.
            }
        )

    return serialized_examples.map(parse_tf_example)

non_batched_train_ds = create_tf_data_dataset("adult_train.recordio.gz")
non_batched_test_ds = create_tf_data_dataset("adult_train.recordio.gz")

It is easier to inspect the loaded examples before applying the batch operator.

In [13]:

Copied!

for example in non_batched_train_ds.take(5):
    print(example)
for example in non_batched_train_ds.take(5):
    print(example)

{'age': <tf.Tensor: shape=(), dtype=int64, numpy=44>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'7th-8th'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=20>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=40>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=37>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'Some-college'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=50>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'<=50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Private'>}
{'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'capital_gain': <tf.Tensor: shape=(), dtype=int64, numpy=20051>, 'education': <tf.Tensor: shape=(), dtype=string, numpy=b'HS-grad'>, 'hours_per_week': <tf.Tensor: shape=(), dtype=int64, numpy=30>, 'income': <tf.Tensor: shape=(), dtype=string, numpy=b'>50K'>, 'workclass': <tf.Tensor: shape=(), dtype=string, numpy=b'Self-emp-inc'>}

As mentioned before, the batch size does not impact the model. 1000 is a good default value.

In [19]:

Copied!

train_ds = non_batched_train_ds.batch(1000)
test_ds = non_batched_test_ds.batch(1000)
train_ds = non_batched_train_ds.batch(1000)
test_ds = non_batched_test_ds.batch(1000)

Train a model¶

All YDF methods (e.g., training, evaluation, analyze) natively consume tf.data.Dataset.

In [20]:

Copied!

learner = ydf.GradientBoostedTreesLearner(label="income")
model = learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(label="income")
model = learner.train(train_ds)

Warning: Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.

WARNING:absl:Column 'age' with NUMERICAL semantic has dtype int64. Casting value to float32.

Train model on 22792 examples
Model trained in 0:00:05.323891

We can then evaluate the model.

In [21]:

Copied!

evaluation = model.evaluate(test_ds)
evaluation
evaluation = model.evaluate(test_ds)
evaluation

Out[21]:

accuracy:

0.839286

AUC: '>50K' vs others:

0.878923

PR-AUC: '>50K' vs others:

0.744216

loss:

0.34535

num examples:

22792

num examples (weighted):

22792

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	16526	782
>50K	2881	2603