HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on the technology transfer and the Knowledge sharing in the HPC Science field.
HPCKP project was founded on late 2010 by Jordi Blasco as an initiative of The Reference Network on Theoretical and Computational Chemistry (XRQTC), with the idea to share a deep knowledge about how to install and to optimize some specific applications in Computational Chemistry.
Nowadays, the HPCKP project has grown up and it provides several articles, tools, conferences, seminars, trainings and some other activities to accomplish the main objectives.
The scientific problems faced by researchers every day have (usually) a solution. However, finding that solution is often limited by the sheer complexity of the problem, something that cannot be done in a personal computer.
Using a computing cluster is a great helper … but it has its features, quirks and hacks. In this talk we present Condor, a queue distribution system used in a lot of HPC clusters (and in HERMES, the I3A’s one). We’ll see the actual setup of our Condor deployment, all the configuration tuning and some success cases (and pit traps) that we’ve come across our life with this big bird.
The map-reduce paradigm has shown to be a simple and feasible way of filtering and analyzing large data sets in cloud and cluster systems. Algorithms designed for the paradigm must implement regular data distribution patterns so that appropriate use of resources is ensured. Good scalability and performance on Map-Reduce applications greatly depend on the design of regular intermediate data generation consumption patterns at the map and reduce phases. We describe the data distribution patterns found in current Map-Reduce read mapping bioinformatics applications and show some data decomposition principles to greatly improve their scalability and performance.
The goal of this talk is to present in a practical way how latest AMD technology works and meets current high performance computing requirements. Concepts such as the performance metrics of GFLOPs and GB/s, performance efficiencies of FPU and memory controllers/channels, scalability of the multi socket platforms, tuning tips such as process/thread affinity, multi Infiniband/GPU and their I/O affinity, impact of appropriate math libraries and compilers, power consumption characteristics on a system when heavily stressed with different HPC workloads will be reviewed. By the end of the talk/session you should walk away with some good foundation on what building block technologies matter for you and how to design and exploit your own HPC solutions.
Computer prices in the last few years have promoted the appearance of many HPC computer clusters, mostly devoted to scientific research. Aiming at making easier their installation and set up, San Diego Supercomputer Center (SDSC) has created a Linux distribution: Rocks cluster, which is based on CentOS and allows sysadmins to have a computer cluster up & running in a single day.
Rocks cluster automatically installs and configures Sun Grid Engine queue system, Ganglia system monitor, MPI environment, a directory service similar to NIS, a network installation software based on PXE and Anaconda, and multiple software packages dedicated to different purposes (compilers, biological data analysis, web servers, etc). Using this software, after installing the first node (called frontend or head node) following a procedure very similar to that of a standard CentOS installation, just executing one command, all the other computer nodes in the cluster can be installed at once and they will have all required software and will be integrated in the cluster environment without needing any kind of manual intervention by the sysadmin.
With a computer cluster already working, this distribution has all necessary tools to administer the nodes from the frontend without any access to individual compute nodes. In the same way, new software can be installed to all nodes using different mechanisms: installation in a shared directory, set up new rpms that will be installed in the nodes on next reboot and through Rolls. Rolls are packages of packages designed to integrate themselves in the managing system in the same way as the base software, some of them are provided by the distribution developers. On the other hand, extended documentation on how to create new ones has promoted the appearance of others created by the community. Thanks to all of this and the fact that is completely open source, it is currently being used at least in 1871 computer clusters all around the world.
Some users are liars, cheaters or at best they don’t fully understand what they are doing. Luckily only a small part of a cluster’s users behaves this way but they can adversely affect the work of the other users who play by the rules.
The type of jobs executed in the LSI cluster are so diverse: commercial software like Matlab, Maple and CPLEX, proprietary software programmed in C, Java, Python,… Currently our policy states that a user may reserve as many slots as real cores need to complete its work. As a result we have a 1:1 slot-cpu ratio.
The execution of single process/single threaded jobs is not a problem: they request one slot and consume one core. However in processes with some level of parallelism, sometimes hidden for a regular user (some Matlab libraries are parallel, Java use threads, …), the core usage does not correspond to the amount of requested slots.
Furthermore, we can set load limits to prevent overloading nodes, but this does not solve the problem and we don’t avoid the possible collapse of the node. On the other hand load limits neither prevents honest users to be penalized.
Using the core binding technique, the CPU afﬁnity mechanism implemented in all the forks of Grid Engine, we can force jobs to use exactly as many cores as reserved slots. Jobs’ processes will remain conﬁned in their assigned cores.
Thus we get a dual goal. We solve the problem with the execution of parallel processes (accounting, overloading, cheating), and we improve the execution time of many processes due to the core conﬁnement avoiding system context switch.
This presentation will describe the opportunities and challenges involved in running climate predictions at very high horizontal resolution. The scaling of the EC-Earth code with the highest resolution possible and the solutions adopted to handle the large output from the different model components (ocean, atmosphere, land and sea ice), typical of climate simulations will be discussed. The submission and monitoring system (Autosubmit) that allows the execution of climate simulations on different platforms in a transparent way for the user will be introduced. Autosubmit acts as a wrapper over the queue system and HPC scheduler and allows an efficient execution of the experiments by pooling together several ensemble members and/or start dates, all independent parallelized climate simulations, in a single multi-thousand core MPI job.
The GridWay Metascheduler enables large-scale, reliable and efficient sharing of computing resources over different grid middlewares. GridWay allows unattended, reliable, and efficient execution of single, array, or complex jobs, both sequential or parallel, on heterogeneous and dynamic grids.
GridWay performs all the job scheduling and submission steps transparently to the end user and adapts job execution to changing grid conditions. The aim of the tutorial is to provide a global overview of the process of installing, configuring and using GridWay. During the tutorial, participants would receive a practical overview of the agenda topics, having the opportunity to install their own GridWay instance and to exercise GridWay functionality with examples on a real grid infrastructure.