Supported dataset formats¶

This list details the supported dataset formats within YDF and their respective advantages and limitations. YDF offers two primary methods for inputting datasets:

Python Dataset Objects: These include familiar structures like Pandas DataFrames or dictionaries of NumPy arrays.
Typed-Paths as a string: Examples include "csv:/dataset/train.csv" for CSV files or "tfrecord:/dataset/train@*" for TensorFlow Records.

Python Dataset Objects

Python dataset objects offer significant flexibility, allowing you to easily apply various preprocessing operations directly to your data. However, they can be less efficient for very large datasets because the entire dataset must be loaded into memory. As a general guideline, Python dataset objects are best suited for datasets containing fewer than 100 million examples.

We recommend using a dictionary of NumPy arrays for Python datasets, though Pandas DataFrames are also well-supported.

Note: For internal users. Loading large datasets directly into a Pandas DataFrame using the %%f1 magic query can be unacceptably slow for anything more than a few thousand examples. Instead, we recommend exporting your data first (e.g., using PLX, optionally with pre-filtering or preprocessing). Once exported, load your data efficiently using the capacitor: format.

Typed-Paths

In contrast, typed-paths are far more memory-efficient and are required for distributed training. Their main drawback is reduced flexibility; preprocessing needs materializing the output.

Currently, Avro files are the recommended format for typed-path datasets. TensorFlow Records are also well-supported.

Available formats

Format	Availability	As python object	Typed-path prefix	Remarks
Dict of numpy array	Public	Native		Efficient; Recommended for small datasets
Pandas dataframe	Public	Native		Popular format
csv	Public		csv:	Popular format; No support for multi-dimentionnal features.
Avro	Public		avro:	Efficient; Recommended for large datasets
TensorFlow Records (gzip)	Public	with ydf.util.read_tf_recordio	tfrecord:	Somehow efficient
TensorFlow Records (non-compressed)	Public	with ydf.util.read_tf_recordio	tfrecordv2:	Inefficient; To avoid
TensorFlow Tensor	Public	Native		Inefficient; To avoid
TensorFlow Dataset	Public	Native		Inefficient; To avoid
Xarray	Public	Native
SSTable of TF Examples	Internal		sstable+tfe:
RecordIO of TF Examples	Internal		recordio+tfe:	Efficient; Recommended for large datasets
RecordIO of YDF Examples	Internal		recordio+ygge:	Very efficient; For advanced users
Capacitor	Internal		capacitor:	Very efficient; No support for multi-dimentionnal features.