Applying clustering and folding techniques to study performance issues on the NEMO global ocean model

Miguel Castrillo (BSC)

More info

Bring your application to a new era: learning by example how to parallelize and optimize for Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor

Manel Fernández (Bayncore Ltd.)

More info

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

George S. Markomanolis (BSC)

More info

IRPF90 : a Fortran code generator for HPC

Anthony Scemama (CNRS/Universite de Toulouse)

More info

Applying clustering and folding techniques to study performance issues on the NEMO global ocean model

Miguel Castrillo (BSC)

Understanding the performance of a parallel application can be a difficult and time-consuming task. The Paraver tool can provide performance insight of an application and allow to identify its bottlenecks. However, once detected which functions have a less than expected performance, the task to figure out which parts of the code are producing the performance downgrade, specially in routines of thousands of lines, can consume a lot of time. In this work we present our methodology based on using clustering and folding tools on Nucleus for European Modelling of the Ocean (NEMO) model, which is known for its computational problems. This is the first study about NEMO using this approach. These tools are provided from the Computer Sciences department of Barcelona Supercomputing Center in order to identify the parts of the code that should be improved to increase application performance. We apply first the clustering tool to group the computation phases between MPI calls with similar properties into clusters and then with the folding tool we can understand the internals of each cluster and correlate with the lines of code using the Paraver tool. Finally, with the combination of various hardware counters the reason of the low performance is discovered.

Download PDF

Bring your application to a new era: learning by example how to parallelize and optimize for Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor

Manel Fernández (Bayncore Ltd.)

As the number of transistors on a chip increases on every generation, old processor design recipes are less valuable to keep power consumption to reasonable levels. Today’s processors are somehow less focused on clock frequency, ILP (instruction level parallelism) and single thread performance, in favor of other types of parallelism as DLP (data level parallelism) and TLP (thread level parallelism). As a result, HPC applications cannot only rely on the compiler and the micro-architecture anymore, but it is programmer’s responsibility to explicitly express parallelism in order to exploit the full performance capabilities of the system underneath, even on a single node.

In this work we will learn tips, advices, and best known methods for parallelizing and optimizing existing HPC applications on Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor. We will see how to apply different programming models to get the best performance out of these platforms, and also identify when a particular application would perform better on the processor, the coprocessor, or in a hybrid scheme.

Download PDF

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

George S. Markomanolis (BSC)

The Earth Sciences Department of the Barcelona Supercomputing Center (BSC) is working on the development of a new chemical weather forecasting system based on the NCEP/NMMB multi-scale meteorological model. We present our efforts on porting and optimizing the NMMB/BSC-CTM model. The purpose is to prepare the model for large scale experiments and increase the resolution of the executed domain. The programming model OmpSs, which is developed at BSC-CNS, enables asynchronous parallelism by the use of data-dependencies between the different tasks and is built on top of Mercurium compiler and Nanos++ runtime system. Through this work we describe how we used this programming model to improve computation functions by creating tasks, using threads, and parallelize significant computation loops only. Moreover, we use OmpSs to overlap communication with computation which is crucial as in the future we expect to scale our application on thousands cores. There is the need to refactor part of the code because of reentrancy issues. For all the results we use the well established tools Extrae/Paraver to instrument, visualize and understand various performance issues. We present the preliminary results and our efforts of porting the model in order to help other scientists to port their applications by using OmpSs.

Download PDF

IRPF90 : a Fortran code generator for HPC

Anthony Scemama (CNRS/Universite de Toulouse)

IRPF90 is a Fortran code generator which helps the development of large Fortran codes. In Fortran programs, the programmer has to focus on the order of the instructions: before using a variable, the programmer has to be sure that it has already been computed in all possible situations. For large codes, it is common source of error.

Using IRPF90 most of the order of instructions is handled by the pre-processor, and an automatic mechanism guarantees that every entity is built before being used. This mechanism, relies on the {needs/needed by} relations between the entities, which are built automatically. The consequence is that the programmer doesn’t need to know the production tree of each entity.

Codes written with IRPF90 execute usually faster than Fortran programs, are faster to write and easier to maintain than standard Fortran programs.

Download PDF

Human Centric Innovation for one Hyperconnected World

Adriano Galano (Fujitsu)

More info

Understanding applications using the BSC performance tool­suite

Harald Servat & Judit Giménez (BSC)

More info

Fault Tolerance Interface Tutorial – Part 1

Leonardo Bautista (Argonne National Laboratory)

More info

Fault Tolerance Interface Tutorial – Part 2

Leonardo Bautista (Argonne National Laboratory)

More info

Human Centric Innovation for one Hyperconnected World

Adriano Galano (Fujitsu)

Fujitsu Technology and Service Vision sets out and globally communicates Fujitsu’s vision of a Human Centric Intelligent Society and how Fujitsu will achieve this vision in partnership with our customers, leveraging our technologies and services. We have proposed Human Centric Innovation, a new approach for our customers to realize innovation.

Understanding applications using the BSC performance tool­suite

Harald Servat & Judit Giménez (BSC)

The BSC performance tools team develops a performance tool ­suite that helps pointing out the performance bottlenecks that the application experiences. The suite consists of three principal applications: Extrae, Paraver and Dimemas, as well as, additional satellite tools that intelligently extracts / summarises additional information from the captured metrics.

This tutorial will give an introductory tour to the BSC performance tool­ suite for the analysis of parallel applications. The tutorial is aimed to anyone interested in learning the performance issues that applications experience in a (HPC) system, but has not used the BSC performance tools previously. The audience should be familiar with terms like IPC (instructions per cycle), memory hierarchy & caches, work unbalance, as well as, key concepts belonging to the MPI and OpenMP parallel programming paradigms (such as task/rank, thread, message, send/receive, among others).

We plan to divide this tutorial in two parts. First, part serves as a theoretical introduction (i.e. supported by slides) to introduce the different tools that compose the suite. Then, we will demonstrate how to use and the Paraver tool and show its flexibility analysing several aspects from the performance behaviour of one selected application.

Download PDF

Fault Tolerance Interface Tutorial – Part 1

Leonardo Bautista (Argonne National Laboratory)

FTI stands for Fault Tolerance Interface and is a library that aims to give computational scientists the means to perform fast and efficient multilevel checkpointing in large scale supercomputers. FTI leverages local storage plus data replication and erasure codes to provide several levels of reliability and performance. FTI is application-level checkpointing and allows users to select which datasets needs to be protected, in order to improve efficiency and avoid wasting space, time and energy. In addition, it offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.

Download PDF

Fault Tolerance Interface Tutorial – Part 2

Leonardo Bautista (Argonne National Laboratory)

FTI stands for Fault Tolerance Interface and is a library that aims to give computational scientists the means to perform fast and efficient multilevel checkpointing in large scale supercomputers. FTI leverages local storage plus data replication and erasure codes to provide several levels of reliability and performance. FTI is application-level checkpointing and allows users to select which datasets needs to be protected, in order to improve efficiency and avoid wasting space, time and energy. In addition, it offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.

Download PDF

Package Once / Run Anywhere Big Data and HPC workloads

Tryggvi Lárusson (GreenQloud)

More info

Harnessing the High Performance Capabilities of Cloud over the Internet

Jaison Paul Mulerikkal (Rajagiritech)

More info

sNow! new Features and Roadmap

Jordi Blasco (HPCNow!)

More info

DRM4G: an open source framework for distributed computing.

Carlos Blanco (Santander Meteorology Group)

More info

Package Once / Run Anywhere Big Data and HPC workloads

Tryggvi Lárusson (GreenQloud)

Big data operations and high performance computing are amongst the largest beneficiaries of – and driving forces behind – the directional change toward hybrid and distributed cloud deployments.

In this talk Tryggvi Lárusson, CTO of GreenQloud will discuss how big data and HPC environments can benefit from streamlining and taking advantage of the emerging solutions designed specifically to cut down the limitations and bloat of technology.

Tryggvi will showcase why handling bare metal and virtual machines in the same manner, both in terms of management through services such as Advania’s HPC cloud powered by QStack™, and operations via CoreOS, greatly enhances productivity and performance.

He’ll give examples of implementations of Hadoop/HDFS clusters, and how “package once/run anywhere” is making it possible to run workloads transparently on cloud, bare metal or virtual machines via newer containerization concepts such as Docker.

Download PDF

Harnessing the High Performance Capabilities of Cloud over the Internet

Jaison Paul Mulerikkal (Rajagiritech)

The most powerful feature of cloud computing is its capacity to transfer computing as a 5th utility after water, electricity, gas, and telephony. Gartner CIO survey 2011 predicts that 23% of computing activity would never move to cloud, even though 43% will be moved to cloud by 2015 and another 31% would be moved by 2020. Some of the High Performance Computing (HPC) applications will be among those 23% that would never move to cloud. Ian Foster et al have nailed it on the head while pointing out the core reason for it as follows:

“The one exception that will likely be hard to achieve in cloud computing (but has had much success in Grids) are HPC applications that require fast and low latency network interconnects for efficient scaling to many processors.”

However the future for high performance computing in cloud is not that bleak. It may be true that some HPC applications whose parallel tasks are too interdependent (and not embarrassingly parallel) may find it difficult to take off on a generic public or even a hybrid cloud. But there would emerge specialized clouds and providers with new tools and technology to enable most of those applications on cloud with acceptable level of speed and efficiency. Science Clouds – supported by Nimbus project – is an early indication of that trend.

Research at the Australian National University has produced a SOA middleware – ANU-SOAM – that is intended to enable high performance outcomes for not so embarrassingly parallel scientific applications. The execution of such applications can be considered as a series (of generations) of executions of a set of pure computation tasks; the execution of each set is separated by a phase of communication. All tasks within a set can thus execute independently. ANU-SOAM supports such a model by introducing a Data Service to implement the communication phase. The Common Data – one-dimensional or two dimensional array – in the Data Service can be accessed, modified and synchronized (add, get, put and sync) by the compute processes (service instances – SI) and can be used for the successive generations of tasks without communicating the updates back to the host process (client). This helps reduce communications and resulting overheads. Early experiments show that this programming model is effective in harnessing cloud-computing resources over slow networks like the Internet comparing to other existing paradigms.

Download PDF

sNow! new Features and Roadmap

Jordi Blasco (HPCNow!)

sNow! is a suite based on Open Source software designed to manage and administrate HPC infrastructures. This suite is developed and maintained by the HPCNow! team which has a solid experience in the use, management and administration of supercomputers in scientific and engineering environments. sNow! is an easy to use and it installs software that provides all the necessary tools to deploy and operate a computing cluster such as OS, monitoring tools, cluster filesystem, batch queue system or parallel and mathematical libraries.

sNow! is specifically designed to obtain the maximum performance from the cluster, allocating the most critical services in HPC environments into High Availability and Load Balancing layer, providing resilience and scalability for the most critical and high demanding HPC environments. The software offers the possibility to replicate over the Internet all the configuration and most critical data for disaster recovery.

sNow! is an OpenSource software, licensed under the GPLv3. The application suite is developed and maintained by HPCNow! which also provides training and professional support services for the application suite.

Download PDF

DRM4G: an open source framework for distributed computing.

Carlos Blanco (Santander Meteorology Group)

Performing computational jobs on heterogeneous computing resources is able to be difficult because of the different middleware available (e.g. PBS/Torque, SGE, LSF, SLURM, Globus, CREAM, etc.). These middle-wares with different interfaces are seldom compatible with each other, creating substantial barriers to users.

To deal with this issue, DRM4G can define, submit, and manage jobs among cluster, grid and cloud resources. It enables a single point of control for these resources. Furthermore, DRM4G provides an adaptive scheduling, with fault recovery mechanisms and on-request and opportunistic job migration, which can operate independently from the resources.

Download PDF

Harnessing your cluster with Ansible

Iñigo Aldazabal (CSIC-UPV/EHU)

More info

Harnessing your cluster with Ansible (hands-on)

Iñigo Aldazabal (CSIC-UPV/EHU)

More info

Taming the Big Data in Computational Chemistry

Carles Bo (ICIQ)

More info

Deploying a Hadoop cluster with sNow! in less than 15 minutes

Alfred Gil (HPCNow!)

More info

Harnessing your cluster with Ansible

Iñigo Aldazabal (CSIC-UPV/EHU)

The way to manage the configuration of computing nodes in HPC clusters is normally through, first, the use of some kind of master image deployed to the nodes and, second, a “post configuration” stage in which the installed system is modified in order to adapt it to the changes made to this base image: modified SLURM configuration files, new filesystems to be mounted, updated packages, new monitoring tools to be installed, etc.

One way to deal with this post-configuration stage, and also with further changes which happen along the life of a computing node, is using a Configuration Management System – CMS. CMS’s, such as CFEngine, Puppet, Chef, Salt, etc., are specifically designed to deal with system configuration changes and to mantain consistency in complex systems: they allow us to define nodes’ service states, configuration files, packages installed, mount points, security policies and much more.

But this also comes with a price: a steep learning curve and the CMS system setup itself. Here we will present Ansible, a very easy to use CMS which, with its clientless (zero initial setup in the nodes) push model and the simple, human readable syntax of its YAML configuration files, perfectly fits the mindset of HPC cluster administrators.

Download PDF

Harnessing your cluster with Ansible (hands-on)

Iñigo Aldazabal (CSIC-UPV/EHU)

After the previous Ansible introductory talk we will walk through setting up a basic configuration for a simple computing cluster created with virtual machines. As an introductory step we will briefly present Vagrant, a tool which will allow us to create reproducible and portable virtual machine development and testing environments.

Using a very simple Vagrant configuration file we will setup a test cluster with a head node and a few computing nodes. Within the head node we will download Ansible, build a rpm package and install it. We will then write the inventory file with nodes of different types, make some test runs and create a basic Ansible playbook for eg. install some packages, distribute some configuration files and configure some services both in the head and the computing nodes.

The Vagrant configuration file and virtual machine template (Vagrant “box”) as well as the base Ansible files we will use will be provided beforehand so that anyone interested will be able to follow along the session in his/her laptop.

Download PDF

Taming the Big Data in Computational Chemistry

Carles Bo (ICIQ)

The massive use of simulation techniques in chemical research generates huge amounts of information, which starts to become recognized as the BigData problem. The main obstacle for managing big information volumes is its storage in such a way that facilitates data mining as a strategy to optimize the processes that enable scientists to face the challenges of the new sustainable society based on the knowledge and the rational use of existent resources.

The present project aims at creating a platform of services in the cloud to manage computational chemistry. As other related projects, the concepts underlying our platform rely on well defined standards and it implements treatment, hierarchical storage and data recovery tools to facilitate data mining of the Theoretical and Computational Chemistry’s BigData. Its main goal is the creation of new methodological strategies that promote an optimal reuse of results and accumulated knowledge and enhances daily researchers’ productivity.

This proposal automatizes relevant data extracting processes and transforms numerical data into labelled data in a database. This platform provides tools for the researcher in order to validate, enrich, publish and share information, and tools in the cloud to access and visualize data. Other tools permit creation of reaction energy profile plots by combining data of a set of molecular entities, or automatic creation of Supporting Information files, for instance. The final goal is to build a new reference tool in computational chemistry research, bibliography management and services to third parties. Potential users include computational chemistry research groups worldwide, university libraries and related services, and high performance supercomputer centers.

Deploying a Hadoop cluster with sNow! in less than 15 minutes

Alfred Gil (HPCNow!)

The main goal of this presentation is to show how we have built sNow!, a Linux distribution capable to deploy a Hadoop cluster in the most comfortably way a sysadmin can imagine (the myth about lazy sysadmins is already well known, and although it’s only a myth, we are proud to contribute to strengthen it). To achieve this objective, the installation of the whole system should be as unattended as possible, and must provide all the required tools the sysadmin may need along the process.

We have chosen the Cloudera package, since it encloses Apache Hadoop as well as an extended collection of tools complementing it, including a powerful management dashboard. Cloudera must be installed on a previously existing operating system, and the set up must be done in two steps. First, the installation and configuration of the cluster, and second, the deployment of Cloudera application.

Aiming to join these two steps in an unified process, we have developed sNow!, a Linux distribution which integrates both, initial cluster installation and the deployment of the hadoop ecosystem. As a result, this tool allows to simplify and automate all the tasks involved in the correct set up of the systems being part of a Hadoop cluster.

We use Debian as the base operating system, since it is one of the most stables distributions available. The final product is an installation ISO, which will install and properly configure the management node. Once the management node is up and running, the operating system in the computing nodes are automatically installed by only booting them. Finally, the configuration and deployment of the Hadoop cluster is done via web interface from the management node.

In this presentation, we will show a live demo of the system, executed on a virtualized environment.

Bring your application to a new era: parallelization and optimization for the Intel® Xeon Phi™ coprocessor – Part 1

Manel Fernández (Bayncore Ltd.)

More info

Bring your application to a new era: parallelization and optimization for the Intel® Xeon Phi™ coprocessor – Part 2

Manel Fernández (Bayncore Ltd.)

More info

Bring your application to a new era: parallelization and optimization for the Intel® Xeon Phi™ coprocessor – Part 1

Manel Fernández (Bayncore Ltd.)

As the number of transistors on a chip increases on every generation, old processor design recipes are less valuable to keep power consumption to reasonable levels. Today’s processors are somehow less focused on clock frequency, ILP (instruction level parallelism) and single thread performance, in favor of other types of parallelism as DLP (data level parallelism) and TLP (thread level parallelism). As a result, HPC applications cannot only rely on the compiler and the micro-architecture anymore, but it is programmer’s responsibility to explicitly express parallelism in order to exploit the full performance capabilities of the system underneath, even on a single node.

In this tutorial we will learn about the Many Integrated Core (MIC) architecture and Intel® Xeon Phi™ coprocessors, and also about the key pillars that make this architecture highly suitable for parallel applications. We will also review the different parallel programming models available for native and offload execution, as well as best known methods for parallelization and optimization that enable the programmer to achieve the best application performance out of this platform. Finally, we will provide a comprehensive overview of Intel® Parallel XE 2015 tool suite, which simplifies the design, development, and tuning of parallel applications running on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. The tutorial will be accompanied by a demo of Intel Parallel Studio XE 2015 tool suite (if time permits).

We have three major goals for this tutorial:

  1. Provide attendees with a clear overview of Intel Many Integrated Core (MIC) architecture and Intel® Xeon Phi™ coprocessor, and the main differences w.r.t. Intel® Xeon® processor architecture.
  2. Review in some depth the parallelism fundamentals that will allow to extract maximum application performance from an Intel® Xeon Phi™ coprocessor.
  3. Cover the latest release of Intel Parallel Studio XE tool suite, which simplifies the design, development, and tuning of parallel applications running on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors.

The concepts presented in this tutorial are not new, and they are inspired on existing online material available to everyone. Nevertheless, the actual tutorial contents, screenshots, and case studies have been developed by the authors of this tutorial.

Download PDF

Bring your application to a new era: parallelization and optimization for the Intel® Xeon Phi™ coprocessor – Part 2

Manel Fernández (Bayncore Ltd.)

As the number of transistors on a chip increases on every generation, old processor design recipes are less valuable to keep power consumption to reasonable levels. Today’s processors are somehow less focused on clock frequency, ILP (instruction level parallelism) and single thread performance, in favor of other types of parallelism as DLP (data level parallelism) and TLP (thread level parallelism). As a result, HPC applications cannot only rely on the compiler and the micro-architecture anymore, but it is programmer’s responsibility to explicitly express parallelism in order to exploit the full performance capabilities of the system underneath, even on a single node.

In this tutorial we will learn about the Many Integrated Core (MIC) architecture and Intel® Xeon Phi™ coprocessors, and also about the key pillars that make this architecture highly suitable for parallel applications. We will also review the different parallel programming models available for native and offload execution, as well as best known methods for parallelization and optimization that enable the programmer to achieve the best application performance out of this platform. Finally, we will provide a comprehensive overview of Intel® Parallel XE 2015 tool suite, which simplifies the design, development, and tuning of parallel applications running on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. The tutorial will be accompanied by a demo of Intel Parallel Studio XE 2015 tool suite (if time permits).

We have three major goals for this tutorial:

  1. Provide attendees with a clear overview of Intel Many Integrated Core (MIC) architecture and Intel® Xeon Phi™ coprocessor, and the main differences w.r.t. Intel® Xeon® processor architecture.
  2. Review in some depth the parallelism fundamentals that will allow to extract maximum application performance from an Intel® Xeon Phi™ coprocessor.
  3. Cover the latest release of Intel Parallel Studio XE tool suite, which simplifies the design, development, and tuning of parallel applications running on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors.

The concepts presented in this tutorial are not new, and they are inspired on existing online material available to everyone. Nevertheless, the actual tutorial contents, screenshots, and case studies have been developed by the authors of this tutorial.

Download PDF