Talks > 20-21/04/2016 Michael Jennings

LBNL Node Health Check: Introduction, Configuration, and Customization

Since its release to the HPC community in 2011, the Lawrence Berkeley National Laboratory (LBNL) Node Health Check (NHC) project has gained wide acceptance across the industry and has become the de facto standard community solution for compute node health checking. It provides a complete, optimized framework for creating and executing node-level checks and already comes with more than 40 of its own pre-written checks. It fully supports TORQUE/Moab, SLURM, and SGE, and can be used with other schedulers/resource managers as well (or none at all). In production at LBNL since 2010, NHC has evolved and matured to become a vital asset in maximizing the integrity and reliability of high-performance computational resources.

In this talk, we’ll discuss what makes LBNL NHC such a unique and robust solution to the problem of compute node health, look at the feature set of NHC, learn how to configure and deploy NHC, and survey many of the available checks that are supplied out-of-the-box. Time permitting, a brief introduction to writing custom or site-specific checks may also be included.

Related Talks

Visit our forum

One of the main goals of this project is to motivate new initiatives and collaborations in the HPC field. Visit our forum to share your knowledge and discuss with other HPC experts!

About us

HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on technology transfer and knowledge sharing in the HPC, AI and Quantum Science fields.

Promo HPCNow