Cool Computing. Challenges for ultra-high density compute clusters

Miguel Terol (Lenovo)

More info

ReFrame: A Regression Testing and Continuous Integration Framework for HPC systems

Vasileios Karakasis (CSCS)

More info

Taming I/O-hungry application beasts with NVMesh & BeeGFS

Sven Breuner (Excelero)

More info

An Update on Singularity Containers, and a Peek into the Future Roadmap

Eduardo Arango (Sylabs)

More info

Cool Computing. Challenges for ultra-high density compute clusters

Miguel Terol (Lenovo)

With the normalization of HPC in the industry, outside the research and academic environments, as well as the burst of Big Data and AI use cases in all sectors, the demand for resources for compute and data hungry applications is increasing exponentially. Therefore, multiplying the horsepower of our compute clusters is a must. In order to make it happen, technology players are refining their chip and platform designs to enable much denser systems.

The trade-off of this trends is chips are getting more and more power hungry, and cooling those components becomes a challenge in terms of sustainability, either for the environment or the economy. In this talk we will present the high density technology landscape and different approaches to address the cooling challenges.

ReFrame: A Regression Testing and Continuous Integration Framework for HPC systems

Vasileios Karakasis (CSCS)

Regression testing of HPC systems is of crucial importance when it comes to ensure the quality of service offered to the end users. At the same time, it poses a great challenge to the systems and application engineers to continuously maintain regression tests that cover as many aspects as possible of the user experience. In this presentation, we introduce ReFrame, a new framework for writing regression tests for HPC systems. ReFrame is designed to abstract away the complexity of the interactions with the system and separate the logic of a regression test from the low-level details, which pertain to the system configuration and setup. Regression tests in ReFrame are simple Python classes that specify the basic parameters of the test plus any additional logic. The framework will load the test and send it down a well-defined pipeline which will take care of its execution. All the system interaction details, such as programming environment switching, compilation, job submission, job status query, sanity checking and performance assessment, are performed by the different pipeline stages. Thanks to its high-level abstractions and modular design, ReFrame can also serve as a tool for continuous integration (CI) of scientific software, complementary to other well-known CI solutions. Finally, we present the use cases of two large HPC centers that have adopted or are now adopting ReFrame for regression testing of their computing facilities.

Download PDF

Taming I/O-hungry application beasts with NVMesh & BeeGFS

Sven Breuner (Excelero)

This presentation will show the details of NVMesh, a new generation of software-defined storage specially designed for NVMe and how to combine NVMesh with the HPC cluster file system BeeGFS.

NVMe as a hardware technology has the potential to solve I/O starvation problems and to vastly increase efficiency of today’s most demanding applications, enabling new I/O patterns and new algorithms for applications. However, while adding NVMe drives to a server is easy and increasingly affordable nowadays, sharing NVMe over the network for clustered applications without sacrificing a lot of their performance is often considered a big challenge.

Download PDF

An Update on Singularity Containers, and a Peek into the Future Roadmap

Eduardo Arango (Sylabs)

Singularity is increasingly recognized as the ideal container technology for AI, Machine/Deep Learning, compute-driven analytics, and Data Science. Recently released Version 3.0 of this open source software incorporates a number of significant enhancements that span from the core of the software itself to the enabling ecosystem that surrounds it. Thus the purpose of this presentation is to provide a technical overview of the following enhancements: reimplementation of the Singularity core in a combination Go and C; the introduction of the Singularity Image Format (SIF) as a file-based paradigm for encapsulating cryptographically signable and verifiable container images; expansion of the Singularity ecosystem through cloud-hosted services for signing and verifying cryptographic keys for SIF images, remotely building images as well as a repository for storing and sharing images; plus miscellaneous enhancements regarding instance support and networking management. Platform enhancements, together with an expanded and better-enabled container ecosystem, combine to set Singularity apart as the optimal choice for compute-driven workloads wherever they exist. Because the Go-based core and SIF enhancements are essential to the roadmap for Singularity, allusions are made here with respect to standards compliance as well as integration with Kubernetes for container orchestration.

Download PDF

Containerized Convergence of Big Data and Big Compute

Christian Kniep (QNIB Solution)

More info

Containing Without Containers

Robert Tracey (IBM)

More info

OpenHPC: Community Building Blocks for HPC Systems

Karl W. Schulz (OpenHPC)

More info

Immersion cooling, High Performance Cooling for HPC

Daniele Rispoli (Submer)

More info

Containerized Convergence of Big Data and Big Compute

Christian Kniep (QNIB Solution)

Since the early days of High Performance Computing operations and usage of such systems was highly dependent on the vendor, the discipline, the diversity of use-cases and the community it was embedded in.

The use-cases and communities of Big Data, Big Compute are converging, due to commoditisation in hard- (x86) and software (Linux), the growing importance of big enterprise IT and the advent of HPC characteristics in AI/ML workloads.

With the introduction of containers as distribution artefact and Kubernetes as orchestration substrate, this convergence might experienced it’s final push to take place.

This talk will dissect the convergence by refreshing the audiences’ memory on what containerization is about, segueing into why AI/ML workloads are leading to fully fledged HPC applications eventually and how this will inform the way forward.

In conclusion Christian will discuss the three main challenges `Hardware Access`, `Data Access` and `Distributed Computing` in container technology and how they can be tackled by the power of open source, while focusing on the first.

Download PDF

Containing Without Containers

Robert Tracey (IBM)

More and more researchers are expanding their workflows to include multiple applications that would normally require an administrator to install, as they traditionally install into all parts of the Operating system. This causes delays to research and increases the administrator’s workload both with the install and upkeep of each new application. Containers have been a great way to solve this but what happens when running on a HPC cluster without containers?

This talk is part of a wider data centric set of activities looking at workflows, that presents a way to build and use contained applications without requiring an administrator to install additional components. By utilizing techniques usually used in High Availability clusters, researchers are able to install applications to assist with Machine Learning and Big Data jobs in a single location. While still making them available to the whole HPC cluster if needed via the job scheduling tool utilised in their environment, such as IBM Spectrum LSF, Slurm or Flux.

Download PDF

OpenHPC: Community Building Blocks for HPC Systems

Karl W. Schulz (OpenHPC)

Over the last several years, OpenHPC has emerged as a community-driven stack providing a variety of common, pre-built ingredients to deploy and manage an HPC Linux cluster including provisioning tools, resource management, I/O clients, runtimes, development tools, containers, and a variety of scientific libraries. Formed initially in November 2015 and formalized as a Linux Foundation project in June 2016, OpenHPC has been adding new software components and now supports multiple OSes and architectures. This presentation will present an overview of the project, currently available software, and highlight more recent changes along with general project updates and future plans.

Download PDF

Immersion cooling, High Performance Cooling for HPC

Daniele Rispoli (Submer)

The design of modern HPC centers is mostly dictated by legacy constraints, both in terms of hardware and of the infrastructure necessary to host and keep the machinery running in a safe operational environment.

The time it takes to prepare the server halls, the space wasted due to power dissipation limits imposing a sparsely populated setup and the sheer amount of electricity required just to cool down the hardware all greatly impact on the TCO of anyone in need of HPC capabilities.

But what if there was a better way that would allow everyone to consume less energy, save space and be, therefore, more eco-friendly? An option that would allow those building data centers to decrease capital costs and allow their customers to deploy any type of IT hardware in a faster, easier, safer and more scalable way?

We, at Submer Immersion Cooling, believe we have developed such a solution through our revolutionary SmartPod technology, which leverages our uniquely designed synthetic liquid, the SmartCoolant, and its Cooling Distribution Unit (CDU) companion to achieve an unprecedented level of:

  • Energy efficiency by consuming up to less than 50% of an air cooled data center thus reaching a Power Usage Effectiveness (PUE) coefficient of <1.03;
  • Density due to a dissipation capacity of over 50 kW in the space of two standard racks, therefore saving > 85% of physical space;
  • Eco-friendliness not only due to the energy and space saved but also thanks to the biodegradable liquid solution which not only cools the servers but can also transport the heat efficiently, to be easily reused for other purposes.

Our product is also modular and composable, for a much faster deployment, scalability and servicing of any size of installation.

Born of HPC and made for HPC: we developed our solution using extensive CFD analysis to ensure a homogeneous operating environment for all components of a server, with 3C ≤ ∆T ≤ 5C, thus providing a uniform and protected medium for the machines to compute in.

Our SmartCoolant is kind to the environment and kind to hardware, extending hardware life by >80%.

With concrete plans to “scale-out” and “scale-up” our Immersion Cooling solution, from anything to containerized edge deployments to complete data centers, we firmly believe we are addressing the most pressing needs of HPC centers around the globe, by enabling them to produce valuable scientific discoveries in a faster, cheaper and more environmentally friendly way.

Download PDF

How BeeGFS excels in extreme HPC scale-out environments

Alexander Eekhoff (ThinkParQ)

More info

Need of HPC in the Himalayas

Umesh Upadhyaya (HPCNepal)

More info

Experiences developing and running numerical simulations on HPC platforms: BSIT and GeNESiS

Claudia Rosas (BSC-CNS)

More info

Chapel Comes of Age: a Language for Productivity, Parallelism, and Performance

Brad Chamberlain (Cray)

More info

How BeeGFS excels in extreme HPC scale-out environments

Alexander Eekhoff (ThinkParQ)

BeeGFS is an open-source parallel file-system and is one of the fastest growing middleware products for HPC and other performance-related environments. Deployed by thousands of users around the globe, BeeGFS is strongly favored by the HPC, AI, Deep Learning, Life Science and Oil and Gas communities.

This session will provide an architectural overview of BeeGFS including a sneak peek into the product development plan, a live demonstration of BeeOND (BeeGFS on Demand) showing exemplarily how the burst buffer functionality of BeeGFS can be used, BeeGFS use cases, and will further demonstrate how BeeGFS excels in extreme HPC scale-out environments.

Download PDF

Need of HPC in the Himalayas

Umesh Upadhyaya (HPCNepal)

Scientific computing is such an exciting realm of technology and there is a severe lack of skills in Nepal in this particular area. Nepal is lagging far behind the rest of the world in scientific research. Nepal also lacks human resources and investment to leapfrog past its global peers by adopting cutting-edge technologies. Having a first good-sized HPC resource at Kathmandu University shall attract new postgrad researchers, entire research projects and, critically, larger grants. Nepal can also be a place to host HPC Data Centers as it has reliable and highly affordable renewable energy resources.

Most important areas where Nepal needs HPC:

Weather Forecasting and Climate Change
Seismic Data Processing
Hydroelectric potential of different rivers (ref: Nepal has 40,000 MW of economically feasible hydropower potential, which has 2.2% of the world’s water resources)
Extensive research in Ayurveda

HPCNepal.org, a not for profit organization is currently pro bono in configuring Nepal’s first Supercomputing facility of Kathmandu University. As there is huge scarce of resource in HPC, the organization plans to introduce the community outreach, provide technology solutions in scientific computing and HPC consulting services to government agencies and institutions.

Download PDF

Experiences developing and running numerical simulations on HPC platforms: BSIT and GeNESiS

Claudia Rosas (BSC-CNS)

To develop efficient HPC software, especially numerical simulations, is a challenging task for scientists and engineers. Sometimes not-so-complex algorithms require hundreds of lines of code and programming hours to run correctly, which do not always imply to be sufficiently efficient. If there is no modular programming, the code may growth in all directions becoming harder to read, debug and understand, not to mentioning the slower programming pace for new developers in the project.

When aiming to run numerical simulations on HPC platforms, developers can benefit from the standards and the low-level details of providing a working system inherent to the frameworks, thereby reducing overall development time.

The Barcelona Subsurface Imaging Tool (BSIT) and the General Numerical Engine and Simulation System (GeNESiS) are two of the frameworks developed in BSC to provide the user with a flexible environment to handle every level of the development, from hardware management to the problem itself. In this talk, we present the general benefits of using the frameworks in current HPC platforms and describe both of them when applied to develop and deploy wave propagation simulations which are highly valuable for the Oil & Gas industry. The knowledge obtained from work with BSIT has motivated the concepts and methodology behind GeNESiS to consolidate years of experience in one robust, flexible and modern tool for numerical simulations.

Download PDF

Chapel Comes of Age: a Language for Productivity, Parallelism, and Performance

Brad Chamberlain (Cray)

Chapel is a programming language that supports productive, general-purpose parallel computing at scale. Chapel’s approach can be thought of as striving to combine the strengths of approaches like Python, Fortran, C, C++, MPI, and OpenMP, yet in a single, attractive language. Though Chapel

has been under development for some time now, its performance and feature set have only recently reached the point where it can seriously be considered by users with HPC-scale scientific, data analytic, and artificial intelligence workloads.

In this talk, I will introduce Chapel for those who are new to the language, and cover recent advances, milestones, and performance results for those who are already familiar with it.

Download PDF

Identifying Opportunities to Improve Efficiency in HPC Clusters

Jordi Blasco (HPCNow!)

More info

Accelerating Earth diagnostics and metrics with Python

Saskia Loosveldt (BSC-CNS)

More info

Spack and the U.S. Exascale Computing Project (ECP)

Todd Gamblin (LLNL)

More info

10 years of EasyBuild, and the Road Ahead

Kenneth Hoste (Ghent University)

More info

Identifying Opportunities to Improve Efficiency in HPC Clusters

Jordi Blasco (HPCNow!)

Jordi Blasco has developed a new open source monitoring tool which allows the HPC user support teams to identify new opportunities to improve the efficiency of the codes being executed on HPC resources. Earlier adopters of this new tool, and through the continuous monitoring of jobs efficiency, have been able to improve the scalability and performance of several codes and workflows. This, in turn, has accelerated research in those HPC facilities adopting this technology.

In addition, the improvement in the global efficiency of the system has had an effect on overall resource allocation, allowing HPC users to run more jobs and/or to deal with bigger problems.

Traditional tools like Ganglia are not normally capable of representing the metrics required to identify inefficient jobs. Nor are they capable of correlating events in order to identify global issues affecting all jobs in the system. The new monitoring system allows the end-user support team to evaluate the performance impact on running jobs due to systemic issues such as a high load on the cluster file system or a high rate of hardware errors in the fabric.

A large number of events to analyze requires the use of Big Data technologies. Data is gathered using custom codes and aggregated into ElasticSearch and InfluxDB. Those open source search and analytics engines, have high reliability and proven scalability. Finally, the data is represented through Grafana, which is a leading tool for querying and visualizing large datasets and metrics.

The talk highlights the most relevant cases where HPC facilities applied the tool to identify tuning opportunities and accelerate research by improving code efficiency.

Keywords: Performance Analysis, Efficiency, Scalability, Job profiling.

Download PDF

Accelerating Earth diagnostics and metrics with Python

Saskia Loosveldt (BSC-CNS)

In order to develop metrics and diagnostics to assess the reliability of climate models, the output variables of said models need to be post-processed. This usually involves simple math operations, but the large amounts of data that need to be handled slow down the computations. At the Computational Earth Sciences group, within BSC-CNS Earth Sciences department, we are exploring the use of Python’s compiler Numba to improve the performance of the metrics and diagnostics targeting both the CPUs and GPUs available at the CTE-POWER cluster.

Download PDF

Spack and the U.S. Exascale Computing Project (ECP)

Todd Gamblin (LLNL)

The U.S. Exascale Computing Project aims to produce an exascale-ready software ecosystem by the time the first exascale systems arrive in 2021. The software stack includes applications, software packages, and libraries from across the DOE, as well as their dependencies. The stack must be built in many different configurations, and it must be simple to deploy for users, developers, and HPC administrators in many fields. To satisfy these needs, ECP chosen Spack as its software deployment tool. Spack is an open-source package manager for HPC. Its simple, templated Python DSL allows the same package to be built in many configurations, with different compilers, flags, dependencies, and dependency versions. It is used on laptops and on the world’s largest supercomputers.

This talk will focus on Spack and the many deployment activities currently surrounding it in ECP, from coordinated software releases, to facility deployment, containerization, and continuous integration. The talk will give a basic overview of Spack, an in-depth look at deployment efforts, and a near-term Spack development roadmap.

Download PDF

10 years of EasyBuild, and the Road Ahead

Kenneth Hoste (Ghent University)

Since its creation in the summer of 2009, EasyBuild has evolved into a standard tool for installing scientific software on HPC systems, backed by an active & engaging worldwide community.

Hence, it is time to look back at how we got to this point, which major developments were done in the last couple of years, the surprises we ran into along the way, and the challenges and opportunities that lay ahead.

The changes in the upcoming EasyBuild version 4.0 will be discussed, as well as the integration with container technologies like Singularity & Docker.

Download PDF