Talks > 08/05/2024 Jiri Jaros

Handling C++ Exceptions in MPI Applications

Managing error states in C++ applications is accomplished through exceptions. In distributed applications, it becomes essential to communicate to other processes when an error occurs, giving the application the option to either recover from the faulty state or gracefully report the error and terminate.

Regrettably, the MPI standard does not offer any built-in mechanisms for handling errors in a distributed environment. This paper presents a new approach for exceptions handling in MPI applications. The goals are to (1) report any faulty state to the user in a nicely formatted way by just a single rank, (2) ensure the application will never deadlock, (3) propose a simple interface and ensure interoperability with other C/C++ libraries. The proposed method adopts a minimalistic interface and offers several advantages. No dedicated rank error handling is required, a single reduce operation is sufficient to confirm the application passed through a checkpoint, deadlock in application cannot interrupt the error handling, and the application always terminates gracefully with an appropriate error message. The code underwent testing with various MPI implementations across a range of up to 1536 ranks. External libraries, specifically the distributed versions of the Fast Fourier Transform (FFTW) and the HDF5 I/O libraries, were selected for their extensive use of collective communications. The testing involved introducing several injected errors into multiple ranks, such as a non-existing input file, disk quota exceeded, incorrect rank in the MPI call, and standard system exceptions.

Remarkably, the code demonstrated proper functionality in all tested scenarios. The code can be downloaded from https://github.com/jarosjir/MPIErrorChecker


Related Talks

Visit our forum

One of the main goals of this project is to motivate new initiatives and collaborations in the HPC field. Visit our forum to share your knowledge and discuss with other HPC experts!

About us

HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on technology transfer and knowledge sharing in the HPC, AI and Quantum Science fields.

Promo HPCNow