Talks > 05/06/2025 Joshua Mora

Taxonomy of errors for large scale ​HPC/AI environments​

Operating a large scale HPC/AI environment presents many challenges. Among them getting the highest ROI (Return of Investment).

A wide range of errors get in your way from achieving a high ROI.

This presentation helps raise awareness and understanding of the different types of errors that one can run into over the 3-5years life cycle of the HPC/AI environment and start thinking on avenues for addressing them.

A comprehensive classification of errors by source (hardware, software, user), by type (misconfiguration, unplanned downtime, degradation) and by frequency (rarely, occasional, frequently), along with the corresponding examples is provided.

The classification is presented sorted by the increase in severity and impact on ROI.

A reliability metric is also provided to assess pragmatically the likelihood of a set of systems and components not failing within a window of time (i.e. the duration of a strong scaling parallel job).

Download PDF


Related Talks

Visit our forum

One of the main goals of this project is to motivate new initiatives and collaborations in the HPC field. Visit our forum to share your knowledge and discuss with other HPC experts!

About us

HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on technology transfer and knowledge sharing in the HPC, AI and Quantum Science fields.

Promo HPCNow