At BSC we use a modified version of ganglia for cluster monitoring. Ganglia was not scalable to thousands of nodes, so BSC developed its own implementation of the gmetad called ggcollector which splits the collection layer from the presentation layer. It is a fast and memory efficient implementation that focuses in collecting information and delivering it to the clients. We will show the implementation and its operation as well as some tools developed to have nearly real time and historical monitoring information of all the systems.
HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on technology transfer and knowledge sharing in the HPC, AI and Quantum Science fields.