Agenda_

08:15 – 09:00 Registration

09:00 Leveraging HPC Infrastructure for AI Workloads – The Role of Kubernetes in AI and HPC

Jordi Blasco

CTO

HPCNow! – Do IT Now Alliance

As the demand for AI workloads surges, there’s a pressing need to adapt existing High-Performance Computing infrastructures to accommodate the requirements of this new emerging user community. While many HPC users already run AI workloads, the influx of new AI users with little exposure to HPC, demands a more accessible and interactive platform. While Kubernetes is emerging as a solution for AI workloads, significant challenges arise as soon as the AI workloads need to scale.

Addressing the needs of this growing user base goes beyond standard Kubernetes solutions. This talk delves into the challenges and solutions to integrating AI workloads with HPC infrastructures. It discusses leveraging existing HPC solutions, such as Warewulf for provisioning Kubernetes alongside Slurm clusters, enabling Slurm to balance the resources effectively for both Kubernetes and traditional HPC workloads based on the load.

This talk will also introduce the work done by HPCNow! to enhance performance and efficiency for AI workloads on Kubernetes.

Looking towards the future, the talk also examines the development of more mature and suitable solutions for AI and HPC on Kubernetes.

09:30 Unlocking the Power of AI: Exploring Innovations with Lenovo and AMD

Joerg Roskowetz

Director Solution Architect AI and Web3 Technology

AMD

Discover how AMD’s cutting-edge technologies are revolutionizing AI applications, from deep learning to natural language processing. Through engaging discussions and real-world case studies, learn how businesses can leverage AMD’s solutions to accelerate AI development, enhance performance, and drive unprecedented innovation.

10:00 – 10:30 Coffee Break

HPC_


10:30 HPC in weather and climate simulations

Mario Acosta

Engineer & Leading Researcher

BSC

Predicting, mitigating, and adapting to climate change and its effects on humans and natural systems is one of the biggest challenges in the 21st century. A new generation of efficient, optimised weather and climate models running on supercomputers is needed for reliable predictions of climate change and the weather and extreme weather events in a changing climate.

The Centre of Excellence in Simulation of Weather and Climate in Europe, ESiWACE, was created to provide HPC expertise and to serve the weather and climate modelling community with innovative technologies and tools to improve the computational model performance, support the adaption to new architectures as MarenostrumV or LUMI and provide enhanced support and training for the community.

Nowadays, the path towards exascale computing holds enormous challenges regarding portability, scalability and data management, which individual institutes can hardly face. In this direction, the third phase of the Centre of Excellence in Simulation of Weather and Climate in Europe, ESiWACE3, will therefore link, organise and enhance Europe’s excellence in weather and climate modelling to enable more detailed resolving weather and climate simulations on the upcoming exascale supercomputers and provide, in the future, the required technology for better and more detailed climate-related risk assessments on a local level.

11:00 The Evolution of HPC for a Sustainable World

Michael Rudgyard

CEO

Alces Flight

The conventional approach to procuring High-Performance Computing (HPC) systems has historically centred around a narrow definition of hardware specifications. However, there is a growing awareness within the HPC community regarding the broader implications of these solutions, encompassing not only their impact on research but also their social and ecological consequences.

This abstract delves into the carbon footprint of HPC systems, considering both the ’embedded carbon’ and the carbon emissions associated with their operation. In the context of increasing pressure on organisations to disclose their carbon footprint, some have introduced a ‘carbon price’ to quantify the economic ramifications of emissions. Despite these efforts, navigating the complexities of carbon measurement remains challenging given the multifaceted nature of carbon usage in HPC systems. The focus of this exploration is to identify measurable parameters and initiate steps toward mitigating the overall environmental impact of HPC systems.

The discussion extends to examining methodologies for organisations to assess their carbon footprint, exploring the potential reuse of waste heat, leveraging power from low-carbon or renewable sources, and implementing strategies to prolong the lifespan of computing equipment or recycle outdated hardware to minimise embedded carbon.

11:30 Why I like Lenovo Confluent – A user perspective

Gilad Berman

HPC Architect

Lenovo

Confluent is Lenovo’s open source scale-out management software, providing significant security improvements, enhanced interoperability and scalability. Confluent includes diskless and diskfull OS deployment, firmware configuration and update, console management, and related capabilities.

Being a ‘heavy’ Confluent user, this session will try to highlight some of my favorite Confluent’s features, and the ones which made my life easier over the years

12:00 Handling C++ Exceptions in MPI Applications

Jiri Jaros

Associate Professor

Brno University of Technology

Managing error states in C++ applications is accomplished through exceptions. In distributed applications, it becomes essential to communicate to other processes when an error occurs, giving the application the option to either recover from the faulty state or gracefully report the error and terminate.

Regrettably, the MPI standard does not offer any built-in mechanisms for handling errors in a distributed environment. This paper presents a new approach for exceptions handling in MPI applications. The goals are to (1) report any faulty state to the user in a nicely formatted way by just a single rank, (2) ensure the application will never deadlock, (3) propose a simple interface and ensure interoperability with other C/C++ libraries. The proposed method adopts a minimalistic interface and offers several advantages. No dedicated rank error handling is required, a single reduce operation is sufficient to confirm the application passed through a checkpoint, deadlock in application cannot interrupt the error handling, and the application always terminates gracefully with an appropriate error message. The code underwent testing with various MPI implementations across a range of up to 1536 ranks. External libraries, specifically the distributed versions of the Fast Fourier Transform (FFTW) and the HDF5 I/O libraries, were selected for their extensive use of collective communications. The testing involved introducing several injected errors into multiple ranks, such as a non-existing input file, disk quota exceeded, incorrect rank in the MPI call, and standard system exceptions.

Remarkably, the code demonstrated proper functionality in all tested scenarios. The code can be downloaded from https://github.com/jarosjir/MPIErrorChecker

12:30 Panel discussion

13:00 – 14:00 Lunch break

Emerging Technologies_


14:00 pDOCA: Offloading Emulation for BlueField DPUs

Muhammad Usman

Research Engineer

BSC

Data Processing Units (DPU) are a new class of programmable processor system on a chip (SoC) that combine industry-standard, high-performance, software-programmable, and processing elements (such as CPUs and GPUs) tightly coupled to the other SoC components.

DPUs are oriented to infrastructure computing platforms for processing software-defined networking, storage, and cybersecurity operations. DPUs can combine powerful computing, high-speed networking, and extensive programmability to deliver hardware-accelerated solutions for the most demanding workloads.

The main manufacturer of DPUs is NVIDIA, which provides them with a specific software development kit (SDK) for their BlueField DPU devices. DOCA, the SDK, is designed to unlock data center innovation by enabling the rapid creation and deployment of applications and services. DOCA provides industry-standard open application programming interfaces (API) and frameworks for, among others, networking, security, storage, high-performance computing (HPC), or artificial intelligence (AI) applications. The frameworks simplify application offload with integrated NVIDIA acceleration packages.

Despite the provided APIs, DOCA still presents an entry barrier for new users, especially HPC programmers, who are not likely to be familiar with the novel programming paradigm. OpenMP DPU Offloading Support (ODOS) framework is presented to overcome this barrier by providing DOCA-based offloading through the standard OpenMP syntax.

Unfortunately, DPU devices are not widely available in production supercomputing clusters today. Therefore, we implemented pDOCA, an emulation layer for programming using DOCA libraries without access to the DPU hardware. pDOCA uses the LD_PRELOAD mechanism of Linux to trap DOCA calls. These calls can then be translated to any other backend, such as TCP or UCX, for actual implementation. pDOCA has been specifically implemented to enable ODOS emulation for users without DPU devices. The infrastructure is composable and can be extended to improve more projects in the future.

14:30 Cloud Native technologies for the AI era

Carlos Arango

Senior Systems software engineer

NVIDIA

As AI becomes more widespread, users and system administrators are striving to develop new platforms to accommodate upcoming workloads. Cloud Native technologies have been preparing for this, but there is still more work to be done. Leveraging resources effectively in Kubernetes can pose challenges around efficiency, configuration, extensibility, and scalability. Cloud Native is fully invested in enabling users to run AI workloads at scale. From going multicluster to integrating SLURM into Kubernetes, the community is working to unlock new levels of scalability and efficiency for AI workloads.

In this talk, we will explore some of the latest technologies being developed by the cloud-native communities and discuss how they aim to address the challenges of the GenAI era by enhancing Kubernetes.

15:00 Enhancing Research Through Advanced Scientific Literature Analysis

Aleksandra Kowalczuk

Cyber Security Senior Analyst

Accenture

The development of natural language analysis, understanding, and generation techniques, along with the ever-growing scientific literature, open up new opportunities to support research achievements. As an example, the medical literature on the breakthrough CAR-T cancer therapy will be discussed, presenting it as a topic requiring computational support due to the multitude of unknowns surrounding it.

The aim is to test whether selected methods, such as a scientific discovery engine built on HPC infrastructure and cloud solutions, as well as machine learning algorithms, including GenAI, can improve the identification of novel insights.

15:30 Revolutionize large-scale data transport. Unleash AI and HPC

Chin Fang

Founder & CEO

Zettar

The November 22, 2022 launch of ChatGPT by OpenAI sparked a firestorm of interest in Large Language Models (LLMs). In turn, the need to move data at scale and speed is receiving ever-growing attention from more people. Nevertheless, the solutions and approaches for meeting this need remain mostly the same. In HPCKP21, the author presented a new software data mover and a logical but fresh way to look at addressing the need. In this talk, we will continue to examine the current situation that people face. An updated way to address the need is recommended as well.

To move data at scale and speed effectively, one should approach the problem with an integrated consideration of all the infrastructure stacks and software so that they work in concert. Nevertheless, it is still very common to see people consider data transport a network-alone or software-alone task. Even worse, treating it as just another compute-centric task.

The rest of the talk will cover a few areas that we are working on to future-proof and reduce the carbon footprint for our solution. We will describe a unique way to streamline the large-scale data movement in the cloud environment, including sovereign AI cloud.

16:00 Compatibility for HPC, a Story of Rainbows and Schedulers

Vanessa Sochat

Principal Computer Scientist

LLNL

The growing expanse of environments across high performance computing and cloud is causing an equivalent expansion of applications suited to those environments. While traditional approaches for matching work to resources can suffice for simple use cases, the increasing complexity of resource subsystems that range from power, to accelerators, to I/O warrants more intelligent algorithms and automation. In this talk, I will share our work in the Open Container Initiative Compatibility working group to define standard specifications for compatibility metadata. I will then talk about compatibility in the context of image selection, building, and scheduling, and software and infrastructure prototypes that empower us to study compatibility in these contexts. I will finish with early experimental results that show interesting trade-offs between matching and run or build success, and how it’s not necessarily the case that more information is always better. This talk is fun, engaging, and literally and figuratively colorful, and I hope that you enjoy it.

16:30 Panel discussion

17:00 – 17:15 Coffee Break


17:15 – 18:00 Trip by bus to the city centre

18:00 – 19:30 Guided visit

19:30 Dinner at Ocaña Restaurant (Plaza Real 13-15)

21:30 Trip by bus to the hotel (from Plaza Drassanes)

09:15 Kubernetes for HPC Services

Davide Pastorino

Do IT Systems – Do IT Now Alliance

Diego Bacchin

Do IT Systems – Do IT Now Alliance

This presentation explores the use of Kubernetes, a powerful container orchestration platform, to deliver high-performance computing (HPC) management and monitoring services. It discusses how Kubernetes can enhance the provisioning, management, and scalability of such HPC services. Key topics include the design decisions regarding the Kubernetes infrastructure and related services, as well as the deployment of HPC management and monitoring services. Attendees will gain insights into leveraging Kubernetes to modernize and optimize HPC management services.

10:00 Optimising HPC Utilation at EBI with HPCNow! Metrics

Janne Elo

HPC Engineer

EBI

10:30 – 11:00 Coffee Break

Quantum_


11:00 Comparative Analysis of Fidelity and Execution Times in Qaptiva for Quantum Noise Modeling

Miriam Bastante

Technical Engineer

EVIDEN

Currently, we are in the NISQ era, where intermediate-sized quantum systems are available, although affected by errors induced by quantum noise that cannot be despised. In this context, quantum emulators offer controlled and reproducible environments to investigate and mitigate these errors associated with quantum noise.

Moreover, they allow for the testing of quantum algorithms before their implementation on real QPUs, thereby avoiding long queues and high costs, thus accelerating the development and optimization of computational resources in the NISQ era. Qaptiva, a quantum emulator offered by Eviden, provides a platform where quantum noise can be modeled thanks to its specific libraries for this application. These libraries include various noise models and offer the possibility of implementing two simulation methods: deterministic and stochastic. By evaluating fidelity and execution time metrics, our goal is to provide information on the performance of Qaptiva as an effective tool for studying quantum noise, comparing its results with those obtained on a real QPU.

11:30 Qilimanjaro’s quantum full-stack approach: an HPC use case

Qilimanjaro is an Analog Quantum Computing company that strives to provide effortless cloud access to our proprietary Coherent Quantum Analog Computer, based on a unique flux qubit architecture.

Furthermore, by giving access to several Application Specific Integrated Circuit (ASIC)-like quantum chips through our Quantum as a Service (QaaS) product, we will create the first Quantum Data Center in Europe to tackle critical problems in drug discovery, supply management, climate modelling, and finance.

In this talk, we are going to present the different iterations that have led to our current prototype of a Slurm-based queuing system. This has increased the availability of the service dramatically, allowing for seamless submission of local and remote quantum experiments.

To conclude, we will present and run a practical demonstration, showcasing the Clauser-Horne-Shimony-Holt (CHSH) experiment which is typically used to demonstrate entanglement in a quantum computer. We will use Qilimanjaro’s QaaS to run the experiment, thus being an example of a full-stack quantum solution.

Javier Sabariego

Backend Engineer

Qilimanjaro

Adrià Blanco

Backend Engineer

Qilimanjaro

Victor Sanchez

Quantum Engineer

Qilimanjaro

12:00 Distributed Hybrid Computing in the Industry

Javier Hernanz

Quantum Advisory Team

Repsol

In a world where computing loads are fully run on cloud and multicloud environments, there is no chance for manual operation. Everything must be ready to be automated with every kind of security, build, deploy and operation mechanisms. Even more if there are HPC or HPDA structures involved. But what if this system must be integrated with a full ecosystem of quantum technologies for computing, sensing, communications, and cybersecurity deployed both on cloud and on the edge? In this session we will see the QCDI (Quantum Cognitive Digital Industry) project, where digital twins for the industry are being generated and deployed taking special care about quantum technologies for data acquisition, enhancement, and use.

12:30 New Trends in Quantum Machine Learning

Ginés Carrascal

Computational Scientist & Architect

IBM Quantum

High-Performance Computing has reached a pivotal moment with the advent of quantum computing. This talk delves into the quantum realm, exploring the foundational elements and applications of quantum computing to machine learning.

Qubits, States, Operators, and Other Secret Monsters.

At the heart of quantum computing lies the qubit, a unit of quantum information that defies classical analogs. We unravel the mysteries of qubits, their states, and the operators that manipulate them, revealing the ‘secret monsters’ of quantum mechanics that empower quantum computers to perform complex calculations.

Parameter-Dependent Quantum Circuits: 

The adaptability of quantum circuits is paramount for their practical application. We examine parameter-dependent quantum circuits, which are essential for running hybrid quantum-classical algorithms. These circuits’ capacity and trainability are assessed, highlighting their potential in the noisy intermediate-scale quantum (NISQ) era.

Learning from Scratch: How to Train Your Quantum Computer

Quantum computing is not just for the experts. We present a beginner-friendly approach to quantum machine learning, demonstrating how enthusiasts can start and understand, programming from scratch the basics of a quantum learning system.

Getting the Most Out of It: Quantum Support Vector Machines 

Quantum Support Vector Machines (QSVMs) represent a significant leap in machine learning. By leveraging the principles of quantum mechanics, QSVMs offer a new paradigm for data classification and pattern recognition. We explore the complexities and advantages of QSVMs, showcasing their superiority in certain applications over their classical counterparts.

13:00 – 14:00 Lunch break

User experience_


14:00 Streaming scientific software has never been so EESSI

Alan O’Cais

Research Fellow, University of Barcelona; Research software engineer

CECAM

Have you ever wished that all the scientific software you use was available on all the resources you had access to without having to go through the pain of getting them installed the way you want/need?

The European Environment for Scientific Software Installations (EESSI – pronounced “easy”) is a common stack of scientific software installations for HPC systems and beyond, including laptops, personal workstations and cloud infrastructure. In many ways it works like a streaming service for scientific software, instantly giving you the software you need, when you need it, and compiled to work efficiently for the architecture you have access to.

In this talk, we’ll explain what EESSI is, how it is being designed, how to get access to it, and how to use it. We’ll include a number of demonstrations and review significant developments of the last 12 months (including support for NVIDIA GPUs, and active development for RISC-V systems).

14:30 Open OnDemand: Connecting Computing Power with Powerful Minds

Developed by the Ohio Supercomputer Center and funded by the U.S. National Science Foundation, Open OnDemand (openondemand.org) is an open-source portal that empowers students, researchers, and industry professionals with web-based access to high performance computing (HPC) services. Open OnDemand frees end-clients from the need to understand the operating environment and complexities of HPC systems and focus instead on accelerating their research. Simultaneously the platform enables computer center staff to support a wide range of clients by simplifying the user interface and experience. In this talk, we briefly describe the Open OnDemand platform, current deployments, integrated applications, and opportunities to engage with the developers and the community that has evolved since the platform was introduced in 2013.

Julie Ma

Co-PI

Open OnDemand Project

Emily Moffat-Sadeghi

Developer Relations Program Manager

Open OnDemand Project

Hazel Randquist

Developer

Open OnDemand Project

15:15 Webaccess to HPC via Open OnDemand

Erica Bianco

Computational Scientist

HPCNow! – Do IT Now Alliance

Open OnDemand (OOD) is becoming increasingly popular as a user-friendly web interface for accessing High-Performance Computing (HPC) resources and data.

With OOD the end-user can access the login nodes from the webpage, access the file explorer, and spin applications to the compute nodes with a simple click.

In addition to the OOD default features, we developed the possibility of adding different kinds of applications. For example, documentation or dashboards can be embedded directly into the OOD interface, and CLI application jobs can be submitted via webform, retrieving the output in the file explorer. Finally, we are working on spinning GUI applications from the file explorer.

Open OnDemand allows end-users with minimal expertise in scheduling and job submission to load complex applications, move data in and out of the file system, and execute jobs independently even more easily.

16:00 Workflow Provenance with RO-Crate in Autosubmit

Manuel G. Marciani

First Stage Researcher

BSC

In this talk we will present the current implementation of workflow provenance using the community maintained open standard RO-Crate within Autosubmit, the BSC in-

house developed experiment and workflow manager. This tool is designed from the ground up to conduct climate, weather, and air quality experiments in different platforms types (local, HPC, cloud), and it is the backbone of multiple leading flagship European projects.

Workflow managers play a central role in receiving user input, processing it with local and remote jobs that run on different platforms and that generate output data. RO-Crate enables tracking of workflow prospective (what should happen, e.g. workflow configuration, Slurm job settings) and retrospective (what happened, e.g. log files, performance indicators) provenance. Hence, all the information already present within the workflow manager.

16:30 – 17:00 Wrap up

Adamantium sponsorship_


Platinum sponsorship_


Promo HPCNow