Evolution of the HPC Landscape in the Coming Future

Jordi Blasco (HPCNow!)

More info

Containers & HPC: What is the big deal?

Olli-Pekka Lehto (CSC)

More info

LXD: Managing containers like virtual machines but without the overhead

Stéphane Graber (Canonical Ltd.)

More info

A closer look at Intel Xeon and Xeon Phi (KNL) for HPC developers

Janko Strassburg (Bayncore Ltd.)

More info

Evolution of the HPC Landscape in the Coming Future

Jordi Blasco (HPCNow!)

During the next couple of years new exciting hardware and software technologies will define a completely new HPC landscape.

New architectures will dramatically change the way we use HPC resources and also the way we write the code. New fabric technologies will enable new ways to communicate between nodes and will also redefine the concept of scalability.

Linux containers will play a key role in application portability without impacting in the required performance of HPC workloads. At the same time new hypervisor technologies will open new opportunities to accommodate security and privacy requirements and also new features never explored before in HPC environments.

This talk will highlight the benefits of these new technologies and also the challenges that the HPC community will need to face.

Download PDF

Containers & HPC: What is the big deal?

Olli-Pekka Lehto (CSC)

In the last few years containers have become one of the hottest topic of discussion in IT. Interest is also growing in HPC.

This talk will introduce various prospective use cases for containers and how they can fit into the HPC ecosystem, what is the state-of-the-art in containerised HPC, what kind of challenges are we facing and how could they be adressed.

Download PDF

LXD: Managing containers like virtual machines but without the overhead

Stéphane Graber (Canonical Ltd.)

LXD is a container management tool built by the LXC team on top of LXC. It offers a complete REST API to manage containers across the network and supports advanced features such as fine grained resource control and live migration.

The LXD container manager is a natural fit for anyone who needs to manage a large number of containers spread across a vast number of hosts. Its advanced resource management capabilities and support for runtime snapshotting of both the underlying filesystem and running state makes it perfect for environments where performance is critical and where the ability to suspend, migrate or abort a task is essential.

This talk will go over the main features of LXD, a pretty thorough demonstration of its abilities, a run through the work currently being done and answer any question the audience may have about LXD.

Download PDF

A closer look at Intel Xeon and Xeon Phi (KNL) for HPC developers

Janko Strassburg (Bayncore Ltd.)

This session will present the latest HPC processors from Intel, Intel Xeon (Broadwell) and Xeon Phi (Knights Landing, KNL), from the perspective of HPC software developers, focussing especially on vectorization as well as performance optimization best practices and tools.

Download PDF

Lenovo update on HPC Open Source Tools

Miguel Terol (Lenovo)

More info

Software Carpentry: teaching computing skills to researchers

Iñigo Aldazabal (CSIC-UPV/EHU)

More info

The challenges of Big Data: a research perspective

David Carrera (BSC)

More info

Introduction to EasyBuild

Alan O'Cais (Forschungszentrum Jülich)

More info

Lenovo update on HPC Open Source Tools

Miguel Terol (Lenovo)

According to its philosophy of open architecture, Lenovo is committed to the development of Open Source software for HPC environments. As a founding member of the OpenHPC Community , Lenovo is contributing to several HPC Open Source tools, especially in cluster management and software deployment.

In this presentation, Lenovo introduces the latest updates on xCAT and other complementary tools (Confluent, OSMWC…) that aim a simplification and improvement of HPC cluster management and deployment in heterogeneous environments, both physical and virtual.

Download PDF

Software Carpentry: teaching computing skills to researchers

Iñigo Aldazabal (CSIC-UPV/EHU)

Software Carpentry mission is to help scientists and engineers get more research done in less time and with less pain by teaching them basic lab skills for scientific computing. The hands-on workshops cover basic concepts and tools, including program design, version control, data management, and task automation. Participants are encouraged to help one another and to apply what they have learned to their own research problems.

These courses are typically aimed at graduate students, post-doctoral researchers and other researchers.

I will present the story and motivations behind the Software Carpentry initiative and explain how the lessons and classes are build, how the workshops are setup, and the methodology behind them. I will also talk about my experience organizing a workshop and everything that makes a Software Carpentry workshop be what it is.

Download PDF

The challenges of Big Data: a research perspective

David Carrera (BSC)

BigData is a hot topic these days, very relevant for the scientific community and for many commercial services. It has appeared as a new area of research and applications that goes beyond the traditional data problems, since the large amount of data available nowadays is presenting many new opportunities for analysis as well as requiring new modes of thinking.

This talk will address the technological challenges associated to BigData problems, and will try to cover some of the hype usually associated to this field of research. The convergence between HPC and BigData will be also covered, discussing some of the most relevant BigData projects in the scientific community. The talk will include details about software and hardware approaches used in different domains, and will also pay attention to some of the solutions provided in the commercial solutions to address the challenges of BigData.

Download PDF

Introduction to EasyBuild

Alan O'Cais (Forschungszentrum Jülich)

One unnecessarily time-consuming task for HPC user support teams is installing software for users. Due to the advanced nature of a supercomputing system (think: multiple multi-core modern microprocessors (possibly next to co-processors like GPUs), the availability of a high performance network interconnect, bleeding edge compilers & libraries, etc.), compiling the software from source on the actual operating system and system architecture that it is going to be used on is typically highly preferred over using readily available binary packages that were built in a generic way. Combine this environment with software applications that are developed by research scientists, who typically lack an extensive background in software development and computer science (“it wouldn’t be called research if we knew what we are doing”), that require a multitude of (typically open source) libraries and tools as dependencies, and you have a recipe for potential disaster.

Moreover, HPC user support teams typically need to provide a variety of builds and versions for most of the software packages. Since supercomputers are simultaneously used by lots of users from different scientific domains that have conflicting needs, just having a single software version installed and updating software installations ‘in place’ is simply not good enough. For a variety of reasons, traditional well-established packaging tools fall short in dealing with scientific software. Until recently, HPC sites typically invested large amounts of time/manpower/money, possibly combined with crude in-house scripting, to tackle this tedious task while trying to provide a coherent software stack. Consequently, a huge amount of duplicate work was being done. Although each system has its own specific characteristics which warrant compilation from source whenever possible, the build & install procedures that need to be followed are usually very similar across sites. Despite that this was well known and recognised, there was a sheer lack of tools to automate this ubiquitous burden on HPC user support teams.

Download PDF

Decrypting your genomic data privately in the cloud

Marc Sitges & Enrique Gonzalez (Made of Genes)

More info

A view inside your HPC infrastructure with elasticsearch, logstash and kibana

Jo Vanvoorden (KU Leuven)

More info

EGA, European Genome-phenome Archive

Angel Carreño & Alfred Gil (CRG)

More info

The remote GPU virtualization from the rCUDA point of view

Federico Silla (UPV/DISCA)

More info

Decrypting your genomic data privately in the cloud

Marc Sitges & Enrique Gonzalez (Made of Genes)

The use of genomic information has lead to unprecedented advances in research and to the development of personalized medicine and wellness. However, harnessing the full potential of that information while guaranteeing patient’s data privacy and security poses big challenges on its incorporation on a clinical setup. At Made of Genes, we have developed a platform for scientists, healthcare providers and customers that puts together genomic data, bioinformatics’ expertise and medical advice to deliver unique personalized products and services.

Genomics data collection and analysis have been historically run on in-house HPC environments, making difficult to match computing capacity with variable demand. On the contrary, storing and processing genomic data in the cloud provides greater flexibility and cost-effectiveness by allocating and paying resources as we need them.

We have set up a hybrid-cloud environment to get the best of both approaches. On one hand, personal health information, including genomic data, is stored in our private computing infrastructure in compliance with international laws and regulations. On the other hand, de-identified copies of the genomic data are also hosted in the cloud, so using cloud bursting methods provides additional computing resources ad-hoc without compromising customers privacy.

Download PDF

A view inside your HPC infrastructure with elasticsearch, logstash and kibana

Jo Vanvoorden (KU Leuven)

At KU Leuven we set up a complete elasticsearch – logstash – kibana stack to monitor the hpc infrastructure.

Logfiles are shipped to a central location where they are parsed and indexed to our central elasticsearch cluster. This information is then presented trough a kibana web frontend. The web frontend allows system administrators and support engineers to easily search the logfiles of all compute nodes and login nodes. Job schedule information, abnormal behavior, executed commands, loaded modules, and many more information is now searchable through a simple web interface. Making a visual representation of the results is now only a matter of a few mouse clicks. Dashboards can be created for reporting and debugging purposes. All information can be accessed and used in real time.

This talk will give an overview of the technical setup and design constraints. Deployment of the infrastructure using our puppet and docker based infrastructure and to end it a real time demo of the analytics.

EGA, European Genome-phenome Archive

Angel Carreño & Alfred Gil (CRG)

The European Genome-phenome Archive (EGA) is a permanent archive that promotes the distribution and sharing of genetic and phenotypic data consented for specific approved uses but not fully open, public distribution. The EGA follows strict protocols for information management, data storage, security and dissemination. Authorized access to the data is managed in partnership with the data-providing organizations. The EGA includes major reference data collections for human genetics research.

In this talk some organizational topics and technical issues on the EGA configuration will be addressed.

Download PDF

The remote GPU virtualization from the rCUDA point of view

Federico Silla (UPV/DISCA)

GPUs are widely used to accelerate scientific applications, but their adoption in HPC clusters presents several drawbacks.First, in addition to increasing acquisition costs, using accelerators also increments maintenance and space costs. Second, energy consumption is also increased. Third, GPUs in a cluster may present a low utilization rate. In consequence, virtualizing the GPUs of the cluster is an appealing strategy to simultaneously dealing with all these drawbacks. Additionally, cluster throughput is increased whereas costs and energy consumption are reduced.

In this talk the remote GPU virtualization technique will be presented as well as its benefits. The talk will also introduce one of the frameworks that implement this virtualization mechanism: the rCUDA middleware. By using the rCUDA framework over a high-performance interconnect such as InfiniBand, the overhead of remote GPU virtualization is reduced to negligible values, with the net result that local and remote GPUs present similar performance. The rCUDA framework will be used as a case study to show that the remote GPU virtualization mechanism provides many benefits to clusters, such as doubling cluster throughput (in jobs/hour), reducing overall energy consumption by more than 40%, creating a flexible way of providing GPUs to virtual machines in a cloud computing facility, providing a large number of GPUs to a single-node application, etc.

Download PDF

LBNL Node Health Check: Introduction, Configuration, and Customization

Michael Jennings (LBNL)

More info

Into the Job: Gaining Insight into Your Workloads Using OGRT

Georg Rath & Petar Forai (IMP/IMBA)

More info

Growth & Innovation with Huawei High Performance Computing

Antonio Garrido (Huawei)

More info

Directions in Workload Management

Alejandro Sanchez (SchedMD)

More info

LBNL Node Health Check: Introduction, Configuration, and Customization

Michael Jennings (LBNL)

Since its release to the HPC community in 2011, the Lawrence Berkeley National Laboratory (LBNL) Node Health Check (NHC) project has gained wide acceptance across the industry and has become the de facto standard community solution for compute node health checking. It provides a complete, optimized framework for creating and executing node-level checks and already comes with more than 40 of its own pre-written checks. It fully supports TORQUE/Moab, SLURM, and SGE, and can be used with other schedulers/resource managers as well (or none at all). In production at LBNL since 2010, NHC has evolved and matured to become a vital asset in maximizing the integrity and reliability of high-performance computational resources.

In this talk, we’ll discuss what makes LBNL NHC such a unique and robust solution to the problem of compute node health, look at the feature set of NHC, learn how to configure and deploy NHC, and survey many of the available checks that are supplied out-of-the-box. Time permitting, a brief introduction to writing custom or site-specific checks may also be included.

Download PDF

Into the Job: Gaining Insight into Your Workloads Using OGRT

Georg Rath & Petar Forai (IMP/IMBA)

With the advent of modern package managers for HPC (EasyBuild, Spack, etc.) automated building of large amounts of software is becoming easier, quickly giving rise to issues related to life cycle management of applications. This makes tracking the applications and libraries that actually get used considerably more important. Existing solutions (module load hooks, launch wrappers) do not account for user-built software, are hard to deploy or produce inconclusive results.

OGRT introduces a way to track the execution of programs and the shared objects they load in a lightweight manner and without launch wrappers. It supports watermarking of binaries, capturing the environment of tracked processes and is transparent to the user. Data is aggregated and persisted into configurable backends (currently Elasticsearch/Splunk).

OGRT is a versatile tool, which can be used to:

  • provide a census of used software (including user-built)
  • troubleshoot problems with user’s programs picking up unexpected shared libraries
  • retroactively inform users about buggy libraries
  • contribute to reproducibility of application runs

This presentation will show how easy it is to deploy OGRT and give a demo of the capabilities of OGRT when plugged into an Elasticsearch backend. Also the production deployment within a bioinformatics focused environment and the insights gained from analyzing data obtained through OGRT will be discussed.

Download PDF

Growth & Innovation with Huawei High Performance Computing

Antonio Garrido (Huawei)

High Performance Computing brings a lot of challenges. How to sustain the performance? How to cool huge infrastructure effectively? What components choose to run workloads that are demanding different resources? How to predict the future? What Innovations could bring to your environment? You can find the answers for above questions joining short introduction of Huawei IT and DC products portfolio.

Huawei is a leading global information and communications technology (ICT) solutions provider. Through its dedication to customer-centric innovation and strong partnerships, Huawei has established end-to-end advantages in telecom networks, HPC and cloud computing. Huawei is committed to creating maximum value for customers by providing competitive solutions and services.

Download PDF

Directions in Workload Management

Alejandro Sanchez (SchedMD)

SchedMD is the core company behind the Slurm workload manager software, a free open-source workload manager designed specifically to satisfy the demanding needs of high performance computing.

The purpose of this presentation is to raise awareness about some directions in the field of HPC workload management. The areas of focus are scalability, data management and new architectures. Specifically, in the area of scalability we will talk about issues and features such as large node and core count, power management, failure management and federated clusters. In the area of data management, we’ll focus on Burst Buffer, which is a high-speed data store. Finally, in the area of new architectures we’ll talk about the KNL (Intel Knights Landing) support in the Slurm workload manager.

Download PDF

Slurm Workload Simulator for Slurm 15.08.6

Massimo Benini (CSCS)

More info

Introduction of a near real-time monitoring plugin in the Slurm open source software

Carlos Fenoy (F. Hoffmann - La Roche)

More info

Slurm Workload Simulator for Slurm 15.08.6

Massimo Benini (CSCS)

The Slurm Workload Simulator’s aim is to provide a means of executing a set of jobs, a workload, in the Slurm system without executing actual programs. The idea is to see how Slurm handles and schedules various workloads under different configurations. For instance, an administrator may be interested to know how a job of a given size from a particular group will get scheduled given the current workload on that system. He would be able to set up a simulator environment and with a workload representing that of the real system plus the hypothetical job, he could submit it to the simulator and see the working of Slurm, in this simulated environment, in faster than real-time.

Download PDF

Introduction of a near real-time monitoring plugin in the Slurm open source software

Carlos Fenoy (F. Hoffmann - La Roche)

With the increasing number of cpu cores in compute nodes of high performance clusters, proper monitoring tools become essential to understand the usage and the behavior of the applications running in the cluster.

In this work a new approach to near real-time monitoring is presented, using the Slurm profiling plugin to display resource usage information for each of the processes running in the cluster. This data improves the understanding of the applications running and can help in highlighting to the user any application-related issue.

Download PDF