Talks > 18-19/06/2020 Benjamin Depardon

Addressing Congestion & Contention issues on HPC Clusters with Analyze-IT

While HPC clusters are designed to run at “peak capacity”, administrators often find themselves facing Congestion and Contention issues with jobs ending-up piling in queues. In such cases, whether the cluster is under-utilized (Congestion) or running at full capacity (Contention), delivering a good QOS to the end-users is administrators’ priority.

UCit’s framework provides a set of customizable tools such as Analyze-IT and Predict-IT created to help identify the optimum strategies (either on-premises or in the Cloud) to match capacity and demand in order to respond properly to these situations.

Fed by HPC clusters’ logs (accounting, applications…) it offers capabilities to explore the behavior of users and jobs on the cluster as well as detect problematic events with the aim of recommending corrective actions. It also allows training of specific ML predictors in order to grant access to tailor-made recommendations on jobs’ parameters and feedback to the users prior to job submission.

This talk will present the framework current capabilities and illustrate how to identify problematic behaviors and possible solutions based on real use-cases.

Related Talks

Visit our forum

One of the main goals of this project is to motivate new initiatives and collaborations in the HPC field. Visit our forum to share your knowledge and discuss with other HPC experts!

About us

HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on technology transfer and knowledge sharing in the HPC, AI and Quantum Science fields.

Promo HPCNow