While HPC clusters are designed to run at “peak capacity”, administrators often find themselves facing Congestion and Contention issues with jobs ending-up piling in queues. In such cases, whether the cluster is under-utilized (Congestion) or running at full capacity (Contention), delivering a good QOS to the end-users is administrators’ priority.
UCit’s framework provides a set of customizable tools such as Analyze-IT and Predict-IT created to help identify the optimum strategies (either on-premises or in the Cloud) to match capacity and demand in order to respond properly to these situations.
Fed by HPC clusters’ logs (accounting, applications…) it offers capabilities to explore the behavior of users and jobs on the cluster as well as detect problematic events with the aim of recommending corrective actions. It also allows training of specific ML predictors in order to grant access to tailor-made recommendations on jobs’ parameters and feedback to the users prior to job submission.
This talk will present the framework current capabilities and illustrate how to identify problematic behaviors and possible solutions based on real use-cases.