If we ask the people who use or manage an HPC center what it means to have an efficient cluster, we will get a full range of opinions. In general, users are happy when they get access to the resources they need immediately or after a minimal wait.
Thus, according to them, an efficient HPC cluster is one that has resources ready to be used on demand. In contrast, HPC administrators are more concerned with the delivery of the cluster, so for them an efficient cluster is one which is working to its full capacity, independently of the numbers of jobs in the queue. In reality, the efficiency of an HPC cluster is more than queue times and CPU allocation.
One of the general reasons why HPC centers have reported high workloads and wait times is because running jobs is not about using the allocated resources properly. Traditional monitoring tools such as Ganglia, Zabbix or Munin just show the current status of the system, which is not enough to identify an inefficient usage. The key parameters to monitor are CPU load and memory efficiency (resources allocated versus resources utilized), plus the reason why a job has finished (completed, canceled, failed, timeout, etc.). At HPCNow!, we developed a tool using Grafana for data visualization, influxDB, Prometheus and ElasticSearch for data storage, and customized collectd and memory efficiency plugins to provide a full overview of the system.
More in detail, the job efficiency monitoring dashboard (see image above) includes several key aspects of the cluster’s CPU usage, such as the total CPUs underused by the users, the average CPU efficiency of the cluster and a graphic representation of the CPUs in use and wasted in real time. In terms of memory, it also includes a chart of the maximum underused memory per user and job. The exit status of the jobs also provides extremely valuable insight to systems administrators and end-user support teams, since they can see, for example, the rate of successful jobs versus the amount of CPUTime lost due to failed jobs.
To highlight the importance of the information that can be extracted from these dashboards, let’s take a look at a real example of an HPC center in the image below. The second pie chart shows that 50% of the jobs finished successfully (in green), the other half finished unsuccessfully mostly due to a timeout (in purple), indicative of an underlying human error, and less than 1% finished unsuccesfully due to hardware issues (yellow and orange).
When running jobs in a local cluster, these hidden efficiency issues impact other users and the return on investment of the cluster. Moreover, it could affect even the economics and the sustainability of the service if, due to the poor performance of the cluster, workloads have to be shifted to the cloud as part of a cloud bursting strategy or potentially to recover from a disaster.
This new technology is a must for those institutions that are facing cluster congestion issues, that want to maximize their return on investment, and/or to keep the cloud bursting budget under control. Additionally, it helps the HPC center to draw a line to define what is reasonable in terms of resource usage, as well as to educate users on using the cluster properly if they are allocating more resources than needed.