HPC Knowledge Portal (Keynote)

Jordi Blasco (HPCKP/NeSI)

More info

HPC Future trends, my 2 cents

Gabriel Verdejo (RDLab)

More info

Exascale: Challenges and Opportunities for the Real World

David Tur (CESCA/HPCNow!)

More info

Stay Tuned

Jordi Blasco (HPCKP/NeSI)

More info

HPC Knowledge Portal (Keynote)

Jordi Blasco (HPCKP/NeSI)

HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on technology transfer and knowledge sharing in the HPC Science field. HPCKP project was founded on late 2010 with the idea of sharing the deep knowledge acquired by people in the HPC field about how to install and how to optimize some specific applications.

Nowadays, the project has grown up and it provides several articles, tools, conferences, seminars, trainings and some other activities to accomplish the main objectives mentioned above. We are proud to have international contributions and collaborations from very active members in the HPC Community and we expect that more people will get involved in the future. May be you? 🙂

The main goals of this project are impartiality and objectivity. For that reason, there is no commercial banners or donation forms. The HPCKP is developed by people that like the work they do and they want to share their knowledge to the world.

Download PDF

Exascale: Challenges and Opportunities for the Real World

David Tur (CESCA/HPCNow!)

In the 2020 horizon it is likely that exascale systems will be available. Many papers and speeches regarding exascale are arising in workshops and conferences in the computing field. Most of them focus in the SW and HW problems and challenges the HPC Community will face when scaling to systems with such a huge number of processors (in the order of 105).

Supercomputing is well established and mature in engineering and sciences as computer simulation-based methods sciences have become the third pillar of sciences (alongside experiments and theory). This talk will try to go one step further showing how HPC is also, and specially will be in the near future, a key element for the economic competitiveness of industries and countries.

Download PDF

Stay Tuned

Jordi Blasco (HPCKP/NeSI)

There was a time when all the people used to take care about the performance, the scalability and the efficiency. Nowadays, with incredible powerful architectures in our hands, a new growing HPC community is requiring platforms to run the simulations far away from the features that the current hardware can provide. Some NeSI (New Zealand eScience Infrastructure) case studies will be used in order to show the importance of tuning the HPC applications.

Download PDF

SUSE Linux Enterprise Server Best Choice for High Performance Computing

Alberto Esteban (SuSE)

More info

SIE LadonOS: Kickstart Install System

David Ramírez (SIE)

More info

Log managing at PIC

A. Bruno Rodríguez (PIC)

More info

GPI-2: a PGAS API for asynchronous and scalable parallel applications

Rui Machado (CC-HPC, Fraunhofer ITWM)

More info

SUSE Linux Enterprise Server Best Choice for High Performance Computing

Alberto Esteban (SuSE)

SUSE Linux Enterprise Server has been designed to handle mission-critical workloads in the data center. It ships with file systems perfectly suited for large scale environments. By looking at the Top500.org web site you can see that SUSE Linux Enterprise Server is the operating system of choice on the world’s largest HPC “super computers” in use today. SUSE Linux Enterprise Server helps you save money, deliver mission-critical data center services reliably and securely and get the most out of your mixed IT environment.

SIE LadonOS: Kickstart Install System

David Ramírez (SIE)

I would like to share with all of you my install and configuration method of HPC Systems. For the last two years I have been upgrading and recycling a great deal of clusters. Thanks to HPC Knowledge Portal, I have developed a system called “Ladon OS”. It allows me to upgrade HPC Systems. Its main advantage is the node install system based on Kickstart. Kickstart System is my best success. I’m very pleased to share it with all of you. It enables me to install many nodes in a very short period of time (Yes indeed. It is posible and it is true!). I would like to explain you, alive, how it works. To do this we will rip the file off so it will enable us to study its parts. Afterwards, I will show you how to customize a kickstart configuration file. And best of all: It is utterly simple. You will see it.

Download PDF

Log managing at PIC

A. Bruno Rodríguez (PIC)

As an distributed environment grows in number of components and servers, extracting and analyzing relevant information from system and application logs becomes a more complex task.

In the presentation we show our approach to an indexed centralized storage, composed of three components:

  • Logstash, as a log collector, parser, converter and forwarder
  • Elasticsearch, as searching and storing engine and
  • Kibana, as visualization interface

The chosen implementation has a good performance in terms of search- ing, is distributed in storage and search computation terms and, as most subsystems can be replicated, is scalable and provides high availability.

Download PDF

GPI-2: a PGAS API for asynchronous and scalable parallel applications

Rui Machado (CC-HPC, Fraunhofer ITWM)

In this talk, we will present GPI-2 . GPI-2 (GPI stands for Global address space Programming Interface) is the second generation of a PGAS API for the development of scalable parallel applications. It focuses on asynchronous, one-sided communication as provided by RDMA interconnects such as Infiniband. GPI-2 has two main goals. The first goal is to increase the communication performance by using directly the network interconnect and minimizing the communication overhead, truly enabling a complete communication and computation overlap. The second goal is to provide a simple API for the development of parallel applications based on more asynchronous algorithms. In the talk we will present an overview of its functionality and discuss some of GPI’s unique features which allow scalable and high-performance applications. This will be supported by presenting performance results on different kinds of applications and benchmarks.

Download PDF

Asynchronous Job Operator (AJO)

Fernando Galindo & Gabriel Verdejo (RDLab-LSI)

More info

Cluster Deployment with FAI

Alfred Gil (CESCA/HPCNow!)

More info

Sex, lies and queues: Confessions of an HPC sysadmin

Antonio Sanz (I3A/University of Zaragoza)

More info

EasyBuild : Getting Scientific Software Installed

Jens Timmerman (UGhent)

More info

Asynchronous Job Operator (AJO)

Fernando Galindo & Gabriel Verdejo (RDLab-LSI)

AJO (“garlic” in Spanish) stands for Asynchronous Job Operator. This useful tool was designed and developed to provide a transparent gateway between your web, application or service and an HPC system. AJO moves your executions to a Distributed Resource Management Application API (DRMAA) compatible queue system, such as Grid Engine family or TORQUE, allowing you to submit, execute and retrieve any kind of greedy task in an easy and fast way. AJO is released under the GNU license and can be found at http://rdlab.lsi.upc.edu/ajo

Download PDF

Cluster Deployment with FAI

Alfred Gil (CESCA/HPCNow!)

FAI (Fully Automatic Installation) is a non-interactive tool to install, customize and manage Linux boxes and software configurations. It allows you to deploy, in a matter of minutes, any Linux distribution in a set of computers, from a few single systems to a cluster or a cloud with several thousands, and all this requiring zero interaction. FAI is flexible: you can define multiple “classes” of systems to manage heterogeneous infrastructures and provision different software to systems with different architectures or functions. FAI is lightweight: it is based in shell and perl scripting, uses no special daemons and no database, and it is fully managed from the command-line. FAI is flexible: it allows you to tune every detail of the system provisioning by using “hooks”.

FAI was started in 1999 by Thomas Lange at the University of Cologne, and nowadays is developed by a team of 6 people. In this talk we will see a very brief introduction to the main features of FAI and its basic configuration to deploy and update a Linux cluster.

Download PDF

Sex, lies and queues: Confessions of an HPC sysadmin

Antonio Sanz (I3A/University of Zaragoza)

System administration is tough. HPC system administration is tougher. Most of the time you have to juggle computing, storage, networks, software and code in order to get the performance required (of course with 100% availability & security). Do you have an screwdriver and some chew gum?. Technology is only a leg of the tripod. You have to deal with users (who always want more), bosses (who always want to pay less) and consultants (who always want your money). People have to be sheph…sorry, managed properly. And don’t forget that all the big changes that you make to your HPC infrastructure can (and should) be treated as projects. Although project management can be hell, well managed can be also the key to heaven (or at least to some inner zen-like peace). I’ve been managing a midsized HPC cluster for more than 10 years, and I’d like to give back some tips and tricks learned (most of the time, by trial/error or utmost failure) to make this challenging task lighter. The tips will be 50%/50% split between technology and management, and black humour will be all around.

Download PDF

EasyBuild : Getting Scientific Software Installed

Jens Timmerman (UGhent)

Maintaining a collection of software installations for a diverse user base can be a tedious, repetitive, error-prone and time-consuming task. Because most end-user software packages for an HPC environment are not readily available in existing OS package managers, they require significant extra effort from the user support team. Reducing this effort would free up a large amount of time for tackling more urgent tasks. In this presentation we present EasyBuild [1] , a software installation framework written in Python that aims to support the various installation procedures used by the vast collection of software packages that are typically installed in an HPC environment – catering to widely different user profiles. It is built on top of existing tools, and provides support for well- established installation procedures. Supporting customised installation procedures requires little effort, and sharing implementations of installation procedures becomes very easy. Installing software packages that are supported can be done by issuing a single command, even if dependencies are not installed yet. Hence, it simplifies the task of HPC site support teams, and even allows end-users to keep their software installations consistent and up to date.

Download PDF

On the way to Exascale Computing

Adriano M. Galano (Fujitsu)

More info

How to Setup a Cluster File System in Less Than 15 Minutes

Alfred Gil (CESCA/HPCNow!) & Salvador Martín (HPCNow!)

More info

Introduction to system monitoring with Nagios, Check_MK and OMD

Iñigo Aldazabal (CSIC-UPV/EHU)

More info

Hands-on: System monitoring with Open Monitoring Distribution

Iñigo Aldazabal (CSIC-UPV/EHU)

More info

On the way to Exascale Computing

Adriano M. Galano (Fujitsu)

Critical challenges arise to achieve the Exaflop range of Supercomputer in 2018-2022 timeframe. Fujitsu will share to attendees the key software and hardware technologies where the Japanese IT giant are innovating to develop the next generation of Supercomputers and how this helps to solve the society needs. Building on our long-standing history of innovation, 30 years of experience in the development of supercomputers and the exceptional depth and breadth of our offering, we provide the enabling technologies and services for a wide range of aerospace, meteorology, astronomy, healthcare and industrial projects.

How to Setup a Cluster File System in Less Than 15 Minutes

Alfred Gil (CESCA/HPCNow!) & Salvador Martín (HPCNow!)

The File System IO is the main bottleneck for several HPC applications. In the current multi-core era, the local filesystem, even using SSD is not enough to provide the expected performance for most demanding applications, and the Cluster File System seems to be expensive and hard to manage for some people. In this 15 minutes talk we are going to explain how to create a Cluster File System designed for temporary IO with “commodity” hardware, covering high bandwidth needs and huge metadata usage. The Institute of Theoretical and Computational Chemistry of the Universitat de Barcelona (IQTCUB) will provide the needed hardware resources for this live demo.

Download PDF

Introduction to system monitoring with Nagios, Check_MK and OMD

Iñigo Aldazabal (CSIC-UPV/EHU)

System monitoring allows us not only being alerted when certain parameters go out of tolerance or when system malfunctions occur in our infrastructures, but it also helps in analyzing resource trends and giving clues about present system problems.

Traditionally Nagios has been the “de facto” industry standard for IT infrastructure monitoring due, amongst other merits, to its flexible notification system, simple plugin design system and Open Source nature. However, the power and flexibility Nagios offers comes with a price: a steep learning curve and complexity in its setup and configuration.

Nowadays the Open Monitoring Distribution (OMD) comes to the rescue offering a pre-packaged Nagios system for a variety of GNU/Linux distributions. Building on top of Nagios and the Check_MK plugin ecosystem, it allows us to e.g. deploy a fairly complete monitoring solution for a medium HPC cluster with notifications, trend visualization, etc., in a matter of hours.

In this talk we will take a look at the basics of Nagios monitoring (types of monitoring, notifications, plugin system, etc.), its advantages and main problems and how the latter are solved, or at least greatly mitigated, by Check_MK and OMD. We will also examine the interaction between these three layers.

Download PDF

Hands-on: System monitoring with Open Monitoring Distribution

Iñigo Aldazabal (CSIC-UPV/EHU)

Some kind of system monitoring software is essential in any system administrator’s toolkit. It allows us being alerted when (or before) problems occur, getting a general overview of the system’s health and also examining resource trends and historical records.

Being Nagios the “de facto” IT monitoring solution, it is well known both for its power and extensibility as well as for its notorious difficulty to setup and configure. The Open Monitoring Distribution (OMD) offers us a pre-packaged, fully working Nagios system which, built on top on Nagios and the Check_MK ecosystem, and together with many other standard Nagios plugins/extensions, makes the installation, setup and maintenance of a full monitoring solution a trivial task.

In this hands-on tutorial we will start from two bare CentOs virtual machines and carry out a step by step procedure which will take us from a zero configuration to a full monitoring system with email notifications, graphs, trends, etc. where one of the machines will become the Nagios/OMD server monitoring the status of both of them. This procedure, for which detailed notes will be provided, should allow us to set up, in a similar fashion, a full monitoring infrastructure for a regular sized (~10-100s of nodes) HPC cluster in a matter of hours.

Download PDF