Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system used on many of the largest computers in the world, including five of the top ten systems on the TOP500 supercomputer list. This presentation will describe recent Slurm development including support for scheduling across a federation of clusters, the KNL processor and some other minor enhancements. A roadmap will also be presented.
The talk will introduce POP, a European project financed by the European Union’s Horizon 2020 research and innovation programme. POP is the Performance Optimisation and Productivity Centre of Excellence in Computing Applications and provides performance assessment, optimization and productivity services academic and industrial codes in all domains. POP services are delivered free of charge to organisations based within the European Union or in the Horizon 2020 Associated Countries.
The talk will include a description of the project and its services, as well as statistics on the studies carried out and some examples of codes analysed and the results obtained.
Lustre* is the predominant HPC filesystem with a share today of over 70% of the TOP100 using it. Intel® has been driving Lustre* since the Whamcloud acquisition back in 2012. This talk will focus around how Intel® is changing its distribution model and how this is positively effecting the amount and quality of the features being put into Lustre*. The technical aspects of some of the new 2.10 LTS features will be discussed here and how these can be leveraged by the entire community.
Multi-rail LNet, progressive file layout, Lustre* snapshots and some of the work we are doing with ZFS as a backend for Lustre*. These are just a number of the features that will be discussed along with the ongoing efforts to open source the proprietary aspects of the Intel® EE for Lustre code such as Intel® Manager for Lustre* and the Hadoop adapters (HAL & HAM).
We used to think of High Performance Computing (HPC) as Huge systems, Profusion investments and Complex scientific applications only. The goal used to be being high in the TOP10/50/500 in the Tera-Peta Flops race but things seem to be changing. There are new market drivers, we are approaching technology limits and new trends arise to work around such “keep up with Moore’s law inhibitors”. HPC is seen now as a special kind of Analytics (or just the other way around). All in all, this change in the goals forces a change in strategy. IT industry is facing also the explosion of AI and Machine Learning, which translates into new requirements and pressure over the Infrastructure. Moreover, in this new world of Smart devices and Intelligent systems, data is exploding and we do require new ways to manage such data-lakes or even data-oceans. Newest IBM POWER servers and Storage solutions shine in those areas due to the rich ecosystem provided by the OpenPOWER Foundation initiative and the revolution around All-Flash-Arrays.
Come and see what’s new in this HPC/HPDA/AI arena!!
One of the system integration mention at CSCS is to evaluate new technologies. These technologies could vary between computing, storage or network. The parallel filesystem is a fundamental component of supercomputing architect. We recently evaluated a parallel file system “BeeGFS” at CSCS. The goal of this evaluation is to study the concept of the mentioned parallel filesystem, installation procedure, key components and results using different benchmark tools.
Energy Aware Computing is a very important for HPC since many data centers have power or electricity bill limitations. This talk will present how Lenovo is working on all aspects of this problem by developing servers, cooling solutions and software to reduce, monitor and control power/energy end to end. It will explain what we do to optimize PUE, ITUE and ERE. It will also present what we do to control power and energy while applications are executing on a system with a new technology called EAR (Energy Aware Run time).
Software distribution between multiple running environments can be tricky these days. There are multiple locations where users could run they’re workloads (multiple HPC systems, cloud, containers, …) and it can be difficult to keep all those environments in sync. It is also difficult for users in big companies to run their application in other business units compute facilities, as not all the software they use exists there.
CernVM File System provides a scalable, reliable and low-maintenance software distribution service. By using aggressive caching on the clients it reduces number of accesses to the central application repository while ensuring availability of the whole application stack in any environment. As CernVM-FS is based on a union file system, it is also possible to present multiple versions of the same repository to the final customer, ensuring reproducibility (from the software point of view) of workloads at any time.
This presentation covers the improvements we made to our HPC setup and workflow, implementing best practices from the DevOps toolchain. The age of bash scripting, for system setup and configuration, has seen its dawn with the rise of configuration management systems like Puppet, Chef, and Ansible.
After we switched from shell scripts to Puppet, we felt the need to take another step in automating our workflow. We implemented an automated validation and build setup using Git for version control, Jenkins for automation, and Docker containers for reproducible builds of our HPC master package.
Deploying our code more often and in a more controlled way, gives us a more consistent code base to deploy our HPC environment. It improved our code quality and diminished bugs and mistakes that went into our production package.
The presentation will focus on how we used tools like Jenkins to improve the reliability of our Puppet code base, and how some simple improvements (git precommit hooks, syntax validation, test deployments,…) can help in the daily management of a Linux HPC environment.
In this talk I will present an overview of the features of the OpenHPC stack, whose first version was released last year. I will also review our experience using it since then, and how we decided to migrate our whole HPC cluster (made of 100 nodes, Infiniband hetereogeneous network and LUSTRE filesystem) to openHPC, highlighting the advantages of this openHPC stack for small-medium-size HPC systems of the like. I will review some highlights of the installation process and a small demo will be presented. Some very few cons will also be exposed, in the sake of fairness.
Building software for HPC environment is now easy with tools like Easybuild or Spack. Testing these applications can be done through vendor scripts such as make test or ctest that are part of the software. Most of these tests are done on binaries in the build directory instead of the install path where binaries will reside. We can reuse the vendor tests to point to the install path but this is very tricky such as recursive Makefiles, or CMakeLists.txt. Furthermore, there is no universal HPC test toolkit that can be used by the entire HPC community to conduct tests using one medium. A project called buildtest aims to help share test scripts among the HPC community. buildtest is an python-based automatic test generating framework that makes use of YAML configurations to generate shell-scripts (.sh) that can run independently or work in CTest framework. buildtest is compatible with apps built with easybuild. buildtest can write binary, compilation, and scripting (R, Perl, Python, Ruby) tests. buildtest can also be used for testing system packages. The goal of this project is to provide HPC sites to have access to a single test toolkit that can build tests according to their software collection. The toolkit will provide HPC engineers a means to check the software and quickly detect broken functionalities. buildtest can be used for educational purposes to help users learn about the application through testscripts.
HPC has quickly adopted agile strategies and DevOps technologies in cluster provisioning. While the standard deployment systems are DevOps friendly, the image based provisioning systems are usually much faster. The Torrent protocol has been key to accelerating OS image propagation and reducing the time to production. On top of both systems, configuration managers like Puppet, Ansible or CFengine can operate in order to provide consistency across the cluster.
OS provisioning based on local disks is often not reliable and is usually more expensive for a reasonable sized cluster. Read-only NFSROOT provisioning allows one to interact with the image but the NFS server becomes a very critical SPOF. Stateless solutions are more reliable but less flexible with potentially high memory footprints.
Experience has shown that all of these strategies are not enough to cover all needs and not flexible enough to accommodate changes online without breaking the valuable DevOps approach.
In this talk I’m going to introduce a new technology developed by HPCNow! as part of the sNow! cluster manager which provides the flexibility of read-only NFSROOT image provisioning, the scalability of diskless provisioning, the reliability of HPC cluster file systems, and the ability to incorporate DevOps and continuous integration strategies.
Industry and Wall Street projections indicate that Machine Learning will touch every piece of data in the data center by 2020. This has created a technology arms race and algorithmic competition as IBM, NVIDIA, Intel, and ARM strive to dominate the retooling of the computer industry to support ubiquitous machine learning workloads over the next 3-4 years. Similarly, algorithm designers compete to create faster and more accurate training and inference techniques that can address complex problems spanning speech, image recognition, image tagging, self-driving cars, data analytics and more. The challenges for researchers and technology providers encompass big data, massive parallelism, distributed processing, and real-time processing. Deep-learning and low-precision inference (based on INT8 and FP16 arithmetic) are current hot topics.
This talk will merge two state-of-the-art briefings.
The goal is to give attendees a sense of the fast-track algorithm + technology combinations for both research and commercial success as well as an overview of the state-of-the-industry and near-term industry directions.
Lmod is a modern environmental module system for providing software to HPC users. XALT is a tool to track software usage in a lightweight manor. Both tools help in managing software on your HPC system. Lmod can track the modules users load. XALT knows what programs and libraries your user execute.
Lmod can help users handle complex software stacks by supporting the software hierarchy. Lmod support a cache to speed up module access and a long list of other features such as user module collections, module properties, semantic versioning etc.
XALT tracks both MPI and non-MPI programs as well as the libraries that users run. Sites can know exactly what kinds of programs user are running. What percentage of programs are MPI based or non-MPI based? Are the kinds of program on your cluster solving Chemistry, Physics, Biology or something else?
One of the biggest challenges when procuring High Performance Computing systems is to ensure that not only a faster machine than the previous one is bought but that the new system is well suited for the organization needs, fit within a limited budget and prove value for money. However, this is not a simple task and failing on buying the right HPC system can have tremendous consequences for an organization.
The acquisition of HPC systems is a complex and time consuming process where different people inside and outside the organization are involved, from legal, management and technical departments to end user’s and suppliers. Typically a HPC procurement can take between 1 and 2 years from the initiation of the project to the start of the system on production and in most cases the following steps are needed: search for funding, gather the needs of the users, decide the system requirements, start the purchase procedure, order the system and install it. During this time, the organization and user’s needs and requirements can evolve and also do change technologies adding even more complexity and uncertainty to the process.
This talk will provide to the attendees an overview of the whole purchasing HPC system and the different challenges that need to be addressed. The presenter experience and lessons learned after having actively participated in more than 15 HPC purchase procedures of different sizes in the last 20 years will be presented.
Many scientific fields have become highly data-driven with the development of computer sciences. For instance, astronomy, meteorology, social computing, bioinformatics are greatly based on data-intensive scientific discovery as large volume of data with various types generated or produced in these science fields. How to probe knowledge from the data produced by large-scale scientific simulation? It is a certain data-intensive problem. One common point exists in these disciplines is that they generate enormous data sets that automated analysis is highly required, which is a demanding data-intensive stage in many scientific methods. There is commensurate growth in expectations about what can be achieved with this wealth of data and computational power. To meet these expectations with available expertise requires new frameworks that make it easier to reliably formalise data-driven methods that exploit high-end architectures to meet the needs of science, industry and society. In this work we present a new data-driven framework, called Asterism, which aims to simplify the effort required to develop data-intensive applications that run across multiple heterogeneous resources, without users having to: re-formulate their methods according to different enactment systems; manage the data distribution across systems; parallelize their methods; co-place and schedule their methods with computing resources; and store and transfer large/small volumes of data
EasyBuild is a framework for building and installing (scientific) software on HPC systems.
Over time, it has grown out to be a well-established tool across HPC sites worldwide that helps to alleviate
the burden of providing a consistent stack of scientific software to end users in a robust and reproducible way.
Stable versions of EasyBuild have been released frequently ever since the first public release in April’12
and the version 1.0 milestone release in November’12.
In this talk we will look back at the evolution of EasyBuild over the years,
highlighting the important aspects and features that were developed over time.
Covered topics will include:
We will also look forward at ongoing developments and future plans.
HPC software is becoming increasingly complex. The largest applications require over 100 dependency libraries, and they combine interpreted languages like Python with lower-level C, C++, and Fortran libraries. To achieve good performance, developers must tune for multiple compilers, build options, and implementations of dependency libraries like MPI, BLAS, and LAPACK. The space of possible build configurations is combinatorial, and developers waste countless hours rebuilding software instead of producing new scientific results.
This tutorial focuses on Spack, an open-source tool for HPC package management. Spack uses concise package recipes written in Python to automate builds with arbitrary combinations of compilers, MPI versions, and dependency libraries. With Spack, users can install over 1,400 community-maintained packages without knowing how to build them; developers can efficiently automate builds of tens or hundreds of dependency libraries; and HPC center staff can deploy many versions of software for thousands of users. We provide a thorough introduction to Spack’s capabilities: basic software installation, creating new packages, and advanced multi-user deployment. Attendees should bring a laptop computer to follow along with hands-on sessions.