Traditional tools, like Ganglia or Munin, are not capable of representing the metrics required to identify inefficient jobs. Nor are they able of correlating events in order to recognize global issues affecting all jobs in the system.
The new monitoring tools introduced in this talk allow the HPC user support teams to identify new opportunities to improve the efficiency of the codes and/or workflows being executed on HPC resources. Thanks to the event correlation capability, the HPC user support teams are also able to evaluate the performance impact on running jobs due to systemic issues such as a high load on the cluster file system or a high rate of hardware errors in the fabric.
Earlier adopters have been able to improve the scalability and performance of several codes and workflows. This, in turn, has accelerated the research and maximised the return of investment in those HPC facilities adopting this technology.
Over time, this solution has evolved into a mechanism for evaluating real needs and trends of the end-user community, providing valuable feedback in the procurement of new computational capacity.
A large number of events to be analyzed requires the use of Big Data technologies. Data is gathered using custom codes and aggregated into ElasticSearch and InfluxDB. Those open-source search and analytics engines have high reliability and proven scalability. Finally, the data is represented by means of Grafana, which is a leading tool for querying and visualizing large datasets and metrics.
The talk highlights the most recent development in proactive job profiling. One of the most time-consuming tasks for end-user support teams is identifying efficiency issues. It usually requires to re-run the same job with instrumentation tools, to analyse the data and, eventually, to fix the issue. With this solution, the support teams can examine all the jobs, including those ones with efficiency issues that are really hard to reproduce.
The dashboards introduced in this talk are designed to accelerate this process by providing a representation of a proactive job profiling, access to the job submit script used and other key metrics.
Keywords: Performance Analysis, Efficiency, Scalability, Job profiling