Talks > 18-19/06/2020 Adam Moody

Scalable Management of HPC Datasets with mpiFileUtils

High-performance computing users generate large datasets by executing parallel applications running many processes, up to millions in some cases. Those datasets vary in structure from one extreme of large directory trees with many small files to the other extreme of just a single large file. However, users often must resort to single-process tools like cp, mv, and rm to manage those massive datasets. This mismatch in scale makes even basic tasks like copying, moving, and deleting datasets painfully slow.

mpiFileUtils provides a library called libmfu and a suite of MPI-based tools to manage large datasets. The mpiFileUtils suite provides tools to handle typical jobs like copy, remove, and compare. It achieves speedups of more than 100x over the traditional single-process tools. Furthermore, libmfu facilitates easy creation of new tools by consolidating common functionality, data structures, and file formats into a common library. The library can even be called directly from HPC applications if so desired. mpiFileUtils runs on the same scalable HPC resources as the application, and as a result, basic data management tasks that used to require hours of time can now be completed in minutes.


Related Talks

Visit our forum

One of the main goals of this project is to motivate new initiatives and collaborations in the HPC field. Visit our forum to share your knowledge and discuss with other HPC experts!

About us

HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on technology transfer and knowledge sharing in the HPC, AI and Quantum Science fields.

Promo HPCNow