Portable containers orchestration at scale with Nextflow

Paolo Di Tommaso (CRG)

More info

Field Notes From the Frontlines of Slurm Support

Alejandro Sanchez (SchedMD)

More info

LiCO: AI orchestration and Energy Aware Runtime on top of OpenHPC

Miguel Terol (Lenovo)

More info

Deploying containerized applications on HPC. A machine learning for cybersecurity example

Vicente Matellán (SCAYLE)

More info

Portable containers orchestration at scale with Nextflow

Paolo Di Tommaso (CRG)

Reproducibility has become one of the most pressing issues in biology and many other computational-based research fields. This impasse has been fuelled by the combined reliance on increasingly complex data analysis methods and the exponential growth of big data. When considering the installation, deployment, and maintenance of computational data-analysis pipelines, an even more challenging picture emerges due to the lack of community standards. Moreover, the effect of limited standards on reproducibility is amplified by the very diverse range of computational platforms and configurations on which these applications are expected to be applied (workstations, clusters, HPC, clouds, etc.).

Software containers are gaining consensus as a solution to the problem of reproducibility of computational workflows. However, the orchestration of large containerised workloads at scale and in a portable manner across different platforms and runtime pose new challenges.

This presentation will give an introduction of Nextflow, a pipeline orchestration tool that has been designed to address exactly these issues. Nextflow is a computational environment which provides a domain specific language (DSL), meant to simplify the implementation and the deployment of complex large-scale containerised workloads in a portable and replicable manner. It allows the seamless parallelization and deployment of any existing application with minimal development and maintenance overhead, irrespective of the original programming language.

Download PDF

Field Notes From the Frontlines of Slurm Support

Alejandro Sanchez (SchedMD)

SchedMD is the core company behind the Slurm, a free open-source workload manager and scheduler designed specifically to satisfy the demanding needs of high performance computing. Slurm is in widespread use at government laboratories, universities and companies world wide. The goal of this presentation is to share knowledge, random notes, observations, and configuration preferences acquired after some years working behind the frontlines of Slurm support.

Download PDF

LiCO: AI orchestration and Energy Aware Runtime on top of OpenHPC

Miguel Terol (Lenovo)

Deep Learning techniques are bursting into today’s landscape for predictive analytics. The highly parallel characteristics of DL algorithms make the HPC cluster architectures suitable to run those workloads. But the large variety of approaches, implementations and parameterizations at training models and running inference sometimes make difficult to implement such techniques in a cluster infrastructure.

Lenovo intelligent Computing Orchestrator (LiCO), a cluster management tool built on top of OpenHPC suite, addresses these issues by adding some features that will help Data Scientists train their models and run inferences without adding code and make good use of the cluster hardware.

Another innovative feature of LiCO is the Energy Aware Runtime (EAR), that provides a way to orchestrate HPC workloads bearing in mind the power consumption of our cluster.

Deploying containerized applications on HPC. A machine learning for cybersecurity example

Vicente Matellán (SCAYLE)

Cybersecurity is a growing research field where different methods and techniques are being tested, many of them based on machine learning approaches. Machine-learning algorithms are highly computationally demanding on the learning phase, which requires services from HPC centers to get the optimized models which could be later used in real environments. However, the cybersecurity field and the tools associated are not welcomed in many HPC facilities. Containerizing is a powerful tool for solving this problem. It allows researchers to define their own run-time environments. Unfortunately, not all the containerizing solutions are well suited for HPC domains. Singularity avoids some of the problems that other container solutions, such as Docker, face in HPC. It has been designed for use in supercomputing environments and includes native support for some of the most widely used technologies such as Infiniband or Lustre. In this work, we present a CNN-based cybersecurity system optimized for an HPC environment. Selecting the optimal architecture for the neural network requires evaluating several alternatives. The method described allows to not only to evaluate the best deep learning framework (TensorFlow, Theano, etc.) but also selecting the optimal architecture for the CNN, by using Singularity containers.

CernVM-FS: Distributing Complex Application Stacks to HPC Compute Nodes

Jakob Blomer (CERN)

More info

A personal CPU centric view into 2020’s HPC market

Joshua Mora (FutureWei Technologies)

More info

OpenNebula’s LXD driver development: LXDoNe

Sergio José Vega (CUJAE)

More info

Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project

Stig Telfer (StackHPC)

More info

CernVM-FS: Distributing Complex Application Stacks to HPC Compute Nodes

Jakob Blomer (CERN)

Many high-performance scientific and data analytics applications are subject to an ever-growing complexity of their software stack. We experienced that first hand within the experiments at the Large Hadron Collider (LHC) at CERN. LHC experiment applications allow hundreds of researchers to plug in their specific algorithms. The software stacks comprise hundreds of thousands of small files and binaries, and they often change on a daily basis. Distributing such applications from a shared software area or containers can be challenging, in particular in HPC environments tuned for parallel writing rather than for high-frequency (meta-)data reading. This talk presents the status and strategic directions of the CernVM File System, a purpose-built file system to address the problem of software distribution. The CernVM File System emerged from the high-throughput and cloud computing environment. Since several years, it is a mission-critical system for the worldwide computing operations of the LHC experiments. Recent targeted developments made it more tractable in pure HPC environments, such as Cori at NERSC in Berkeley and Piz Daint at CSCS in Lugano (#3 of the TOP500). The talk will outline the experience from these installations and future plans for CernVM-FS in HPC environments.

Download PDF

A personal CPU centric view into 2020’s HPC market

Joshua Mora (FutureWei Technologies)

Briefing on CPU industry technology trends with a personal technical analysis of processing features. How it will lead into an expansion of the server offering and how it will impact into the HPC market for 2018-2020.

OpenNebula’s LXD driver development: LXDoNe

Sergio José Vega (CUJAE)

Operating-System-Level Virtualization is an emerging technology, capable of delivering superior performance and scalability values compared to other mechanisms such as Hardware-Assisted Virtualization. It is currently making its way into cloud infrastructures, where infrastructure service providers such as Amazon already provide container-based services on virtual machines with solutions such as Docker or LXC-inspired proprietaries such as Kubernetes from the google container engine. Few, like Joyent, provide infrastructure as a service on a bare-metal container platform, on which the advantages of Operating-System-Level Virtualization can be exploited to the maximum. However, in private clouds, infrastructure managers such as OpenStack and CloudStack, provide little support, null, or third parties. OpenNebula, has already been enriched with the integration of LXC. However, LXD is a very recent technology that acts as interface over LXC allowing greater functionality, usability and security. This is why the objective outlined in the present work was the development of a driver for OpenNebula, which would allow it to support LXD. The developed driver supports different functionalities like deploying containers on file systems and distributed storage; add / remove network interfaces and disks to contenders; and limit the use of resources between containers.

Download PDF

Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project

Stig Telfer (StackHPC)

The HBP is an EU flagship project that seeks to “provide researchers worldwide with tools and mathematical models for sharing and analysing large brain data they need for understanding how the human brain works”. The HBP encompasses massively parallel applications in neuro-simulation, with advanced computational and data movement requirements way beyond current technological capabilities and is therefore driving innovation in HPC infrastructure.

As part of the R&D co-design, The HBP sub-contracted Cray to develop a pilot system next-generation supercomputer and software stack to address cutting-edge brain simulation and analysis activities. This system features a novel memory / storage hierarchy consisting of new memories, nvRAM and SSDs, accessed through Ceph over the Intel Omnipath interconnect. On the software side, Cray developed a coarse-grain data movement framework, the universal-data-junction, in support of this activity. This talk will describe how Cray, StackHPC and the HBP collaborated on the design project, demonstrating work-in-progress results of the HPC use-cases using the experimental hardware and software setup.

Download PDF

Dynamic Provisioning with sNow! Cluster Manager

Jordi Blasco (HPCNow!)

More info

HPC Workforce Development: Chasing Unicorns in the Global Gig Economy

Elizabeth Leake (STEM-Trek)

More info

Is it me, or is it the machine?

Judit Gimenez (BSC)

More info

HPC and Deep Learning

Gunter Roeth (NVIDIA)

More info

Dynamic Provisioning with sNow! Cluster Manager

Jordi Blasco (HPCNow!)

Common hardware setup for HPC is not suitable for providing private cloud environments or container-centric clusters.

Traditional HPC cluster managers are not able to manage cloud solutions or container-centric clusters. In addition to that, common deployment solutions for cloud and containers are not able to work at the scale required for HPC and also not able to re-provision the nodes dynamically in a reasonable timeframe.

Traditional tools require supercomputing facilities to procure different hardware for each activity (HPC, containers, cloud) and adopt different technologies for provisioning each solution. This approach is obviously expensive and it usually involves a lot of overhead for the administration of the services.

HPCNow! has redefined this approach in order to provide a response to the long tail of research and engineering needs.

sNow! cluster manager allows to dynamically re-architect/re-purpose the HPC solution to provision Singularity containers, Docker Swarm clusters, and OpenNebula private clouds. This new approach allows accommodating those needs with zero investment in hardware capability.

Thanks to the improvements in the cluster provisioning, sNow! is able to redefine baremetal in less than one minute at any scale. This is not only key for HPC. Environments that rely on container-centric solutions based on Docker Swarm or private clouds based OpenNebula solutions can take advantage of this new technology and achieve the velocity and resilience provided by sNow! cluster manager.

Download PDF

HPC Workforce Development: Chasing Unicorns in the Global Gig Economy

Elizabeth Leake (STEM-Trek)

Elizabeth Leake is an external relations specialist and storyteller. After serving 20 years in communications and technology administration roles at public universities, Leake joined the US National Science Foundation’s TeraGrid project in 2008. As TeraGrid’s first external relations coordinator, she led a nationally-distributed team of communicators who chronicled research discoveries that were enabled by the public investment in advanced cyberinfrastructure. Her stories have been featured by HPCwire, International Science Grid This Week (iSGTW/ScienceNode), InsideHPC, the Chicago Council on Global Affairs Global Food for Thought blog and others.

In 2012, Leake founded STEM-Trek, a global, grassroots nonprofit organization that supports travel, mentoring and high performance computing (HPC) workforce development opportunities for scholars from underrepresented groups and regions. Through STEM-Trek’s NGO platform, Leake serves as an industry voice and advocate for HPC-curious scholars everywhere.

Leake’s interest in global eInfrastructure was ignited when she served as a point facilitator for DEISA/PRACE and TeraGrid/XSEDE HPC Summer Schools in Catania, Italy and South Lake Tahoe, California in the US. She has since organized many more advanced skills workshops, and as a correspondent has covered a variety of technical meetings and conferences, including U.S. Open Science Grid All Hands meetings; European Grid Infrastructure (EGEE/EGI) Community Forums; South African Center for High Performance Computing (CHPC) national meetings; the First International Conference on the Internet, Cybersecurity and Information Systems in Gaborone, Botswana; and the Southern African Development Community (SADC) HPC Forums. In 2018, she is covering PRACEdays18 (for HPCwire) in Ljubljana, Slovenia; the HPC Knowledge Meeting in Barcelona, Spain; the International Supercomputing Conference in Frankfurt, Germany; the International Data Week/SciDataCon conference in Gaborone, Botswana; and the South African Centre for HPC National Meeting in Cape Town, South Africa.

Leake’s HPC Knowledge Meeting presentation is titled, “Chasing Unicorns in the Global Gig Economy.” She will share strategies to recruit, retain and foster teams that possess the hard and soft skills necessary to support advanced technologies, while serving the ever-diversifying communities of practice who rely on HPC and data science to power transformative discoveries.

Download PDF

Is it me, or is it the machine?

Judit Gimenez (BSC)

Sometimes programmers do a good job but the execution does not report the expected performance. The first step to improve the performance is to understand what’s going on, and flexible tools like Paraver are key to face unknown problems that can be anywhere. During the tools trainings I use to ask for a code from the attendees to be used for the demo, but sometimes no example is provided. In these cases, I run the Lulesh benchmark. As a result, I have collected Paraver traces of Lulesh running on many different systems. The talk will present the high variability I have observed on the achieved performance with the same source code, as well as the ability of our performance tools to pinpoint what is causing the loose of performance.

Download PDF

HPC and Deep Learning

Gunter Roeth (NVIDIA)

In recent years, major breakthroughs were achieved in Computer Vision using Deep Neural Networks (DNN).

Performance of image classification, segmentation, localization have reached levels not seen before.

After a brief introduction to deep learning on GPUs, we will address a selection of open questions physicists may face when using deep learning for their HPC work. Research is making progress towards answering these questions but there remains plenty to be done in the field by the deep learning and HPC communities.

Gunter Roeth joined NVIDIA as a Solution Architect in October 2014 having previously worked at Cray, HP, Sun Microsystems and most recently BULL. He has a Master in geophysics from the Institut de Physique du Globe (IPG) in Paris and has completed a PhD in seismology on the use of neural networks (artificial intelligence) for interpreting geophysical data.

Download PDF

Software Stack Testing with buildtest

Shahzeb Siddiqui (Pfizer)

More info

Using machine learning to predict and analyze jobs’ behavior

Benjamin Depardon (UCit)

More info

Singularity containers for Enterprise Performance Computing (EPC)

Eduardo Arango (Sylabs)

More info

Software Stack Testing with buildtest

Shahzeb Siddiqui (Pfizer)

A typical HPC facility supports hundreds of applications that are supported by the HPC team. Building these software packages is a challenge and then figuring out how this software stack behaves due to system changes (OS release, kernel path, glibc, etc…).

Application Testing is difficult. The commercial and open-source application typically provide test scripts such as make test or ctest that can test the software after building (make) step. Unfortunately, these methods perform tests prior to installation so the ability to test software in production is not possible. One could try to change the vendor test script to the install path but this requires significant change into a complex makefile

Writing test scripts manually can be tedious, also there is no sharing of tests and most likely they are not compatible to work with other HPC sites because of different software stack and hardcoded paths specific to the site. Easybuild takes a step at improving application build process by automating the entire software workflow that can be built on any HPC site.

Buildtest takes a similar approach as EasyBuild but focusing on application testing.

Download PDF

Using machine learning to predict and analyze jobs’ behavior

Benjamin Depardon (UCit)

Cluster logs contain historical data that relates job submission parameters to the job execution time, final state, consumed memory… We apply machine-learning techniques to unveil information hidden in the logs and predict jobs’ behavior prior to submission, to reduce waste of resources and improve the efficiency of the cluster.

In this talk we’ll present two tools that allow to understand and predict the behavior of jobs on clusters:

1. Predict-IT: Predict jobs’ behavior in order to enforce that submitted jobs will end up correctly – this increases cluster production and profitability

2. Analyze-IT: Understand cluster behavior in order to find ways to improve its efficiency

Singularity containers for Enterprise Performance Computing (EPC)

Eduardo Arango (Sylabs)

Singularity is the most widely used container solution in high-performance computing (HPC). Enterprise users interested in AI, Deep Learning, compute drive analytics, and IOT are increasingly demanding HPC-like resources. Singularity has many features that make it the preferred container solution for this new type of “Enterprise Performance Computing” (EPC) workload. Instead of a layered filesystem, a Singularity container is stored in a single file. This simplifies the container management lifecycle and facilitates features such as image signing and encryption to produce trusted containers. At runtime, Singularity blurs the lines between the container and the host system allowing users to read and write persistent data and leverage hardware like GPUs and Infiniband with ease. The Singularity security model is also unique among container solutions. Users build containers on resources they control or using a service like Singularity Hub. Then they move their containers to a production environment where they may or may not have administrative access and the Linux kernel enforces privileges as it does with any other application. These features make Singularity a simple, secure container solution perfect for HPC and EPC workloads.