10:45 - 11:15
Jordi Blasco has developed a new open source monitoring tool which allows the HPC user support teams to identify new opportunities to improve the efficiency of the codes being executed on HPC resources. Earlier adopters of this new tool, and through the continuous monitoring of jobs efficiency, have been able to improve the scalability and performance of several codes and workflows. This, in turn, has accelerated research in those HPC facilities adopting this technology.
In addition, the improvement in the global efficiency of the system has had an effect on overall resource allocation, allowing HPC users to run more jobs and/or to deal with bigger problems.
Traditional tools like Ganglia are not normally capable of representing the metrics required to identify inefficient jobs. Nor are they capable of correlating events in order to identify global issues affecting all jobs in the system. The new monitoring system allows the end-user support team to evaluate the performance impact on running jobs due to systemic issues such as a high load on the cluster file system or a high rate of hardware errors in the fabric.
A large number of events to analyze requires the use of Big Data technologies. Data is gathered using custom codes and aggregated into ElasticSearch and InfluxDB. Those open source search and analytics engines, have high reliability and proven scalability. Finally, the data is represented through Grafana, which is a leading tool for querying and visualizing large datasets and metrics.
The talk highlights the most relevant cases where HPC facilities applied the tool to identify tuning opportunities and accelerate research by improving code efficiency.
Keywords: Performance Analysis, Efficiency, Scalability, Job profiling
Licentiate (BSc + MSc) in physics with specialisation in computational physics from University of Barcelona (Spain), Jordi has +14 years experience in High Performance Computing (HPC) on industry and academic environments, +9 years experience in solutions architect on mission-critical environments and +9 years experience in leadership & coordination of cutting-edge HPC projects.
Jordi has a solid background in parallel programming, performance analysis, application tuning, HPDA and HPC system administration. Jordi is a very active contributor of several open-source projects dealing with High-Performance Computing and large-scale systems and he has worked as an independent HPC advisor for several companies and research centres.
Barcelona Advanced Industry Park, Marie Curie, s/n, 08042 - Barcelona (Spain).