Since its release to the HPC community in 2011, the Lawrence Berkeley National Laboratory (LBNL) Node Health Check (NHC) project has gained wide acceptance across the industry and has become the de facto standard community solution for compute node health checking. It provides a complete, optimized framework for creating and executing node-level checks and already comes with more than 40 of its own pre-written checks. It fully supports TORQUE/Moab, SLURM, and SGE, and can be used with other schedulers/resource managers as well (or none at all). In production at LBNL since 2010, NHC has evolved and matured to become a vital asset in maximizing the integrity and reliability of high-performance computational resources.
In this talk, we’ll discuss what makes LBNL NHC such a unique and robust solution to the problem of compute node health, look at the feature set of NHC, learn how to configure and deploy NHC, and survey many of the available checks that are supplied out-of-the-box. Time permitting, a brief introduction to writing custom or site-specific checks may also be included.