High-performance computing users generate large datasets by executing parallel applications running many processes, up to millions in some cases. Those datasets vary in structure from one extreme of large directory trees with many small files to the other extreme of just a single large file. However, users often must resort to single-process tools like cp, mv, and rm to manage those massive datasets. This mismatch in scale makes even basic tasks like copying, moving, and deleting datasets painfully slow.
mpiFileUtils provides a library called libmfu and a suite of MPI-based tools to manage large datasets. The mpiFileUtils suite provides tools to handle typical jobs like copy, remove, and compare. It achieves speedups of more than 100x over the traditional single-process tools. Furthermore, libmfu facilitates easy creation of new tools by consolidating common functionality, data structures, and file formats into a common library. The library can even be called directly from HPC applications if so desired. mpiFileUtils runs on the same scalable HPC resources as the application, and as a result, basic data management tasks that used to require hours of time can now be completed in minutes.