Easy and Efficient Energy management with EAR

More info

Addressing Congestion & Contention issues on HPC Clusters with Analyze-IT

More info

GekkoFS – A Temporary Burst Buffer File System for HPC Applications

More info

XALT: Job-Level Usage Data on Today’s Supercomputers

More info

Easy and Efficient Energy management with EAR

EAR is an energy management framework for data centers. It offers a complete set of coordinated components covering different layers of the software stack: the EAR runtime, the EAR daemons running in compute nodes, EAR DB manager, taking care of DB accesses, the EAR global manager, controlling the total energy/power consumption, and the EAR plugin, taking care of notifications for job submissions.

The EAR Daemon and job submission plugin make EAR transparent to users offering the first, and probably main, goal of EAR: It must be EASY to use. By adding the EAR runtime to the equation, we add the second main capability: Energy efficiency.

EAR runtime offers energy optimization for mpi applications. It is driven by a highly efficient algorithm to automatically detect application loops. Once detected the application structure, application signature is computed per-iteration and used to apply energy models and policies. The output of the energy policy is the optimal CPU frequency for this region of code.

The presentation will provide an EAR overview but focussing on EAR runtime and EAR utilization by data center users. Given EAR reports data to a DB, using simple command line programs, EAR-aware users can easily collect advanced information from their jobs without having to use multiple external tools that many times needs root privileges or are difficult to understand.

Addressing Congestion & Contention issues on HPC Clusters with Analyze-IT

While HPC clusters are designed to run at “peak capacity”, administrators often find themselves facing Congestion and Contention issues with jobs ending-up piling in queues. In such cases, whether the cluster is under-utilized (Congestion) or running at full capacity (Contention), delivering a good QOS to the end-users is administrators’ priority.

UCit’s framework provides a set of customizable tools such as Analyze-IT and Predict-IT created to help identify the optimum strategies (either on-premises or in the Cloud) to match capacity and demand in order to respond properly to these situations.

Fed by HPC clusters’ logs (accounting, applications…) it offers capabilities to explore the behavior of users and jobs on the cluster as well as detect problematic events with the aim of recommending corrective actions. It also allows training of specific ML predictors in order to grant access to tailor-made recommendations on jobs’ parameters and feedback to the users prior to job submission.

This talk will present the framework current capabilities and illustrate how to identify problematic behaviors and possible solutions based on real use-cases.

Download PDF

GekkoFS – A Temporary Burst Buffer File System for HPC Applications

Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data, while storage systems in today’s HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O. GekkoFS is a temporary, highly-scalable burst buffer file system that pools node-local storage, available within compute nodes, and has been optimized to the aforementioned use cases. The file system provides relaxed POSIX semantics, which only offers features that are actually required by most (not all) applications. GekkoFS provides scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of conventional parallel file systems. In this talk, we will present GekkoFS’ motivation and fundamental design considerations that result in its almost linear scalability. Further, we will discuss its performance and future goals.

Download PDF

XALT: Job-Level Usage Data on Today’s Supercomputers

We’re interested in what users are actually doing: everything from which applications, libraries, and individual functions are in demand, to preventing the problems that get in the way of successful computational research. And this year we’re especially interested in some of the next great challenges, including

  1. Understanding the needs of formerly non-traditional research communities that comprise half the user community and whose non-MPI workflows consume more than a third of the computing cycles
  2. Track Container usage on HPC systems
  3. Track the use of GPU’s

We are now tracking individual Python packages and similar usage within other frameworks like R and MATLAB. XALT (xalt.readthedocs.org) is a battle-tested tool focused on job-level usage data; it enjoys a well-documented history of helping administrators, support staff, and decision makers manage and improve their operations. The small but growing list of centers that run XALT includes NCSA, UK NCC, KAUST, NICS, the University of Utah, and TACC. Join us a far- ranging discussion that will begin with an overview of new XALT capabilities before it ventures into broader strategic and technical issues related to job-level activity tracking.

Download PDF

Genomics Optimized Servers Accelerate COVID-19 Discovery with High-Throughput Analytics

More info

European Environment for Scientific Software Installations (EESSI)

More info

On-Prem performances with the flexibility of the cloud

More info

Scalable Management of HPC Datasets with mpiFileUtils

More info

Genomics Optimized Servers Accelerate COVID-19 Discovery with High-Throughput Analytics

As of today, Johns Hopkins University reports over 7 million confirmed cases of COVID-19 and over 400K deaths in 188 countries around the world. With the statistics still climbing, biomedical researchers and clinicians around the world are racing toward gaining a better understanding of the novel coronavirus and developing innovative ways to combat the pandemic. Have you ever wondered:

  • What type of insights are COVID-19 researchers interested in?
  • What is the role of HPC in enabling COVID-19 discoveries?
  • Or, what can scientific computing teams do to support their Life Sciences researchers in these efforts?

In this seminar, I will address these questions and make a call to action pointing to specifics where HPC-savvy people can contribute. Here, I will also show you the work our team is spearheading in accelerating high-throughput genomics analytics by reducing the execution of genomics workflows from days to minutes and thereby accelerating the path to COVID-19 discoveries.

European Environment for Scientific Software Installations (EESSI)

What if there was a way to avoid having to install a broad range of scientific software from scratch on every HPC cluster or cloud instance you use or maintain, without compromising on performance?

The European Environment for Scientific Software Installations (EESSI, pronounced as “easy”) is a brand new collaboration between different European HPC sites & industry partners, with the common goal to set up a shared repository of scientific software installations that can be used on a variety of systems, regardless of which flavor/version of Linux distribution or processor architecture is used, or whether it’s a full size HPC cluster, a cloud environment or a personal workstation.

The concept is heavily inspired by the Compute Canada software stack, which was presented at PEARC’19 under the title “Providing a Unified Software Environment for Canada’s National Advanced Computing Centers”.

It consists of three layers:

In this talk, we will present how the EESSI project grew out of a need for more collaboration to tackle the challenges in the changing landscape of scientific software and HPC system architectures. The project structure will be explained in more detail, covering the motivation for the layered approach and the choice of tools, as well as the lessons learned from the Compute Canada approach. Finally, we will outline the goals we have in mind and how we plan to achieve them going forward.

===
For more information about the EESSI project:

* website: https://www.eessi-hpc.org
* GitHub: https://github.com/EESSI
* documentation: https://eessi.github.io/docs
* Twitter: https://twitter.com/eessi_hpc
===

Download PDF

On-Prem performances with the flexibility of the cloud

HPC is a late moving workload to the cloud. A lot of companies tried to get their software running on public cloud but there was always one main problems: Performance. Oracle Cloud brings on-prem performance in an elastic model. Thanks to Bare Metal instances and ultra low latency RDMA, running an HPC workload is as fast, and often faster than on-premise cluster. This presentation will show how we cracked it and demo how to actually spin up a cluster that is ready to go in a few minutes.

Download PDF

Scalable Management of HPC Datasets with mpiFileUtils

High-performance computing users generate large datasets by executing parallel applications running many processes, up to millions in some cases. Those datasets vary in structure from one extreme of large directory trees with many small files to the other extreme of just a single large file. However, users often must resort to single-process tools like cp, mv, and rm to manage those massive datasets. This mismatch in scale makes even basic tasks like copying, moving, and deleting datasets painfully slow.

mpiFileUtils provides a library called libmfu and a suite of MPI-based tools to manage large datasets. The mpiFileUtils suite provides tools to handle typical jobs like copy, remove, and compare. It achieves speedups of more than 100x over the traditional single-process tools. Furthermore, libmfu facilitates easy creation of new tools by consolidating common functionality, data structures, and file formats into a common library. The library can even be called directly from HPC applications if so desired. mpiFileUtils runs on the same scalable HPC resources as the application, and as a result, basic data management tasks that used to require hours of time can now be completed in minutes.

Download PDF

BeeOND HPC with BeeGFS – performance consideration to set up a parallel file system

More info

UnifyFS: A file system for burst buffers

More info

Running GPU workloads on OpenShift

More info

Performance analysis and optimization of GPU based large scale deep learning training workloads

More info

BeeOND HPC with BeeGFS – performance consideration to set up a parallel file system

This session will provide a comprehensive, hands-on overview of how to set up, and get the most out of your parallel file system including benchmarking, performance, and overall system optimization.

The presentation will also profile customer case studies, highlighting how users are solving their challenges today with the deployment of a parallel file system, and how they accelerate their HPC Scale-out environments.

UnifyFS: A file system for burst buffers

UnifyFS is a user-level file system that is highly-specialized for fast shared file access on high performance computing (HPC) systems with distributed burst buffers. UnifyFS delivers significant performance improvements over general purpose file systems by supporting the specific needs of HPC workloads with reduced POSIX semantics support called “lamination semantics.” In this talk, we will give an introductory overview of how to use the lightweight UnifyFS file system to improve the I/O performance of HPC applications. We will describe how UnifyFS works with burst buffers, the benefits and limitations of lamination semantics, and how users can incorporate UnifyFS into their jobs. Finally, we will detail the current implementation status of UnifyFS and our plans for the future.

Download PDF

Running GPU workloads on OpenShift

Using general available packages (in the form of container images) from an official source or a certified provider comes with a big caveat in relation to performance-sensitive workloads. These packages may provide ABI compatibility, but they are not optimized for our specialized hardware (like GPUs or high-performance NICs), nor our CPU chip architecture. The best way to address this is to compile your packages (build your images) on your own deployment.

OpenShift provides a way to seamlessly build images based on defined events called BUILDS. A build is the process of transforming input parameters into a resulting object. Most often, the process is used to transform input parameters or source code into a runnable image. A BuildConfig object is the definition of the entire build process.

The missing part to building hardware-specific images is to orchestrate the build process over the different available resources. In this presentation, attendees will learn about the Node Feature Discovery (NFD) operator and how to tie it to OpenShift builds to have a hardware-specific image build.

Download PDF

Performance analysis and optimization of GPU based large scale deep learning training workloads

Review of the large scale related optimizations performed on the well known resnet50 training workload on DGX based clusters. The optimization concepts which combine the profiling and modeling of workload’s execution at scale can be applied to other deep learning neural networks running on GPU based clusters.

Automated benchmarking with JUBE

More info

Enabling AI/DL on HPC Infrastructure through Containers and Open OnDemand

More info

In a word, AI is impacting HPC “everywhere”!

More info

Job Efficiency Monitoring in HPC Clusters

More info

Automated benchmarking with JUBE

Benchmarking is a common task to evaluate the hard- and software environment of an HPC system during its procurement phase. As HPC systems also evolve over time by updating libraries, software packages or by installing new hardware components, all those system changes can influence the performance of user applications as well. This leads to the need for an automatic benchmarking environment to allow a continuous performance evaluation.

The JUBE benchmarking environment provides a flexible, lightweight, script based framework to setup benchmark tasks on top of generic benchmark applications or by using full user applications. The environment allows controlling major aspects of the benchmark execution such as the parameter variation handling, the workflow execution, data handling, asynchronous execution to support HPC job submission, and the benchmark result extraction.

The talk will present the capabilities of the current generation of the JUBE environment. It will present the general basics how to port and configure a benchmark application to make it available within JUBE and will discuss possible use cases such as fully automated scheduled benchmark setups.

Download PDF

Enabling AI/DL on HPC Infrastructure through Containers and Open OnDemand

Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) have become pervasive in our everyday lives. As fields outside computer science and other computationally intensive programs find themselves inundated with data conducive to these machine driven learning algorithms, it has become increasingly necessary to support these non-traditional users on high performance computing (HPC) clusters designed by and for users with a heavy computer science background. Here we present a platform combining Open OnDemand and containers to enable all users to more rapidly use available high performance compute clusters. Importantly, time to science, dead time between identified computational need, account request, first logon and successful job submission, is reduced to close to zero. Additionally, we show how, OnDemand in combination with containers, reduce user support needs through an intuitive interface.

In a word, AI is impacting HPC “everywhere”!

The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers – versed in AI technology – to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”.

Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics based modeling and simulation as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs plus the development of powerful data analytic services.

Download PDF

Job Efficiency Monitoring in HPC Clusters

Traditional tools, like Ganglia or Munin, are not capable of representing the metrics required to identify inefficient jobs. Nor are they able of correlating events in order to recognize global issues affecting all jobs in the system.

The new monitoring tools introduced in this talk allow the HPC user support teams to identify new opportunities to improve the efficiency of the codes and/or workflows being executed on HPC resources. Thanks to the event correlation capability, the HPC user support teams are also able to evaluate the performance impact on running jobs due to systemic issues such as a high load on the cluster file system or a high rate of hardware errors in the fabric.

Earlier adopters have been able to improve the scalability and performance of several codes and workflows. This, in turn, has accelerated the research and maximised the return of investment in those HPC facilities adopting this technology.

Over time, this solution has evolved into a mechanism for evaluating real needs and trends of the end-user community, providing valuable feedback in the procurement of new computational capacity.

A large number of events to be analyzed requires the use of Big Data technologies. Data is gathered using custom codes and aggregated into ElasticSearch and InfluxDB. Those open-source search and analytics engines have high reliability and proven scalability. Finally, the data is represented by means of Grafana, which is a leading tool for querying and visualizing large datasets and metrics.

The talk highlights the most recent development in proactive job profiling. One of the most time-consuming tasks for end-user support teams is identifying efficiency issues. It usually requires to re-run the same job with instrumentation tools, to analyse the data and, eventually, to fix the issue. With this solution, the support teams can examine all the jobs, including those ones with efficiency issues that are really hard to reproduce.

The dashboards introduced in this talk are designed to accelerate this process by providing a representation of a proactive job profiling, access to the job submit script used and other key metrics.

Keywords: Performance Analysis, Efficiency, Scalability, Job profiling

Download PDF

Sarus: Highly Scalable Docker Containers for HPC Systems

More info

Parallel coupling strategy for multi-physics applications in eXtended Discrete Element Method

More info

Sarus: Highly Scalable Docker Containers for HPC Systems

Sarus is a container engine for HPC systems that provides the means to instantiate high-performance containers from Docker images. It has been designed to address the unique requirements of HPC containers, such as integration with hardware accelerators, quick deployments at scale, security and permissions enforcement on multi-tenant hosts, and parallel filesystems. Sarus leverages the Open Container Initiative (OCI) specifications to extend the capabilities of a standard runtime through dedicated hook programs, implementing container customization at deployment time and without user’s intervention.

This presentation will highlight how OCI hooks can enable portable, infrastructure-agnostic images to achieve native performance on top of HPC-specific devices such as GPUs and high-speed interconnects. Thanks to their standalone nature and standard interface, OCI hooks can be independently developed to target specific features, and can be configured according to the characteristics of particular host systems. The same container image can thus be used across the whole development workflow, from early tests on a personal workstation to deployments at scale on flagship systems, while benefiting from the advantages of each platform.

Download PDF

Parallel coupling strategy for multi-physics applications in eXtended Discrete Element Method

Multi-physics problems containing discrete particles interacting with fluid phases are widely used industry for example in biomass combustion on a moving grate, particle sedimentation, iron production within a blast furnace, and selective laser melting for additive manufacturing.

The eXtended Discrete Element Method (XDEM) uses a coupled Eulerian-Lagrangian approach to simulate these complex phenomena, and relies on the Discrete Element Method (DEM) to model the particle phase and Computational Fluid Dynamics (CFD) for the fluid phases, solved respectively with XDEM and OpenFOAM. However, such simulations are very computationally intensive. Additionally, because the DEM particles move within the CFD phases, a 3D volume coupling is required, hence it represents an important amount of data to be exchanged. This volume of communication can have a considerable impact on the performance of the parallel execution.

To address this issue, XDEM has proposed a coupling strategy relying on a co-located partitioning. This approach coordinates the domain decomposition of the two independent solvers, XDEM and OpenFOAM, to impose some co-location constraints and reduce the overhead due to the coupling data exchange. This strategy for the parallel coupling of CFD-DEM has been evaluated to perform large scale simulations of debris within a dam break flow.

Download PDF