EROFS Being Extended To Handle Massive Amounts Of Data For AI Model Training

The EROFS open-source, read-only Linux file-system is set to be extended with the upcoming Linux 6.15 kernel cycle to support massive amounts of data to support AI model training.

In advance of the Linux 6.15 merge window opening following the v6.14 kernel release, EROFS is seeing 48-bit addressing support added to handle larger file-systems with an emphasis on making the file-system more suitable for AI training purposes and other large-scale data archival needs.

EROFS logo

Gao Xiang explained in the patch series from the Alibaba engineer adding the 48-bit layout support:

“The current 32-bit block addressing limits EROFS to a 16TiB maximum volume size with 4KiB blocks. However, several new use cases now require larger capacity support:

- Massive datasets for model training to boost random sampling performance for each epoch;
- Object storage clients using EROFS direct passthrough.

This extends core on-disk structures to support 48-bit block addressing, such as inodes, device slots, and inode chunks.

In addition, it introduces an mtime field to 32-byte compact inodes for basic timestamp support, as well as expands the superblock root NID to an 8-byte rootnid_8b for out-of-place update incremental builds.

In order to support larger images beyond 32-bit block addressing and efficient indexing of large compression units for compressed data, and to better support popular compression algorithms (mainly Zstd) that still lack native fixed-sized output compression support, introduce byte-oriented encoded extents, so that these compressors are allowed to retain their current methods.

Therefore, it speeds up Zstd image building a lot by using:

Processor: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz * 96
Dataset: enwik9
Build time Size Type Command Line
3m52.339s 266653696 FO -C524288 -zzstd,22
3m48.549s 266174464 FO -E48bit -C524288 -zzstd,22
0m12.821s 272134144 FI -E48bit -C1048576 –max-extent-bytes=1048576 -zzstd,22

It has been stress-tested on my local setup for a while without any strange happens.”

This 48-bit support is the headline feature of EROFS updates for 6.15-rc1. There is also support for encoded extents to reduce metadata on large pclusters, enabling unaligned compressed data for improved Zstd compression speed, and restoring 16 byte volume names.