Skip to content

Supported dataset formats

This list details the supported dataset formats within YDF and their respective advantages and limitations. YDF offers two primary methods for inputting datasets:

  • Python Dataset Objects: These include familiar structures like Pandas DataFrames or dictionaries of NumPy arrays.

  • Typed-Paths as a string: Examples include "csv:/dataset/train.csv" for CSV files or "tfrecord:/dataset/train@*" for TensorFlow Records.

Python Dataset Objects

Python dataset objects offer significant flexibility, allowing you to easily apply various preprocessing operations directly to your data. However, they can be less efficient for very large datasets because the entire dataset must be loaded into memory. As a general guideline, Python dataset objects are best suited for datasets containing fewer than 100 million examples.

We recommend using a dictionary of NumPy arrays for Python datasets, though Pandas DataFrames are also well-supported.

Note: For internal users. Loading large datasets directly into a Pandas DataFrame using the %%f1 magic query can be unacceptably slow for anything more than a few thousand examples. Instead, we recommend exporting your data first (e.g., using PLX, optionally with pre-filtering or preprocessing). Once exported, load your data efficiently using the capacitor: format.

Typed-Paths

In contrast, typed-paths are far more memory-efficient and are required for distributed training. Their main drawback is reduced flexibility; preprocessing needs materializing the output.

Currently, Avro files are the recommended format for typed-path datasets. TensorFlow Records are also well-supported.

Available formats

Format Availability As python object Typed-path prefix Remarks
Dict of numpy array Public Native Efficient; Recommended for small datasets
Pandas dataframe Public Native Popular format
csv Public csv: Popular format; No support for multi-dimentionnal features.
Avro Public avro: Efficient; Recommended for large datasets
TensorFlow Records (gzip) Public with ydf.util.read_tf_recordio tfrecord: Somehow efficient
TensorFlow Records (non-compressed) Public with ydf.util.read_tf_recordio tfrecordv2: Inefficient; To avoid
TensorFlow Tensor Public Native Inefficient; To avoid
TensorFlow Dataset Public Native Inefficient; To avoid
Xarray Public Native
SSTable of TF Examples Internal sstable+tfe:
RecordIO of TF Examples Internal recordio+tfe: Efficient; Recommended for large datasets
RecordIO of YDF Examples Internal recordio+ygge: Very efficient; For advanced users
Capacitor Internal capacitor: Very efficient; No support for multi-dimentionnal features.