The Swiss National Supercomputing Centre (CSCS) is significantly expanding its computational infrastructure through the scale-up of the Alps architecture—an HPE Cray EX system featuring approximately 10752 NVIDIA Grace-Hopper GH200 superchips. This new capacity complements an existing ecosystem of over 4K heterogeneous compute nodes, including AMD Rome CPUs, AMD MI250x and MI300 GPUs, as well as NVIDIA A100 GPUs. The integration of such a diverse set of architectures presents unprecedented challenges in system observability, performance monitoring, and energy optimization at scale.
To address these challenges, CSCS has developed a scalable and flexible observability platform, running on K8s and exploiting git-ops technologies, but tailored to the demands of large, heterogeneous HPC systems. Designed for seamless integration with HPE’s native monitoring solutions, this platform adheres to HPE operational guidelines while extending observability capabilities across compute, storage, network and facility domains. Its modular architecture supports the ingestion, correlation, and visualization of telemetry data, offering deep insights into system performance and health.
In addition to traditional telemetry, CSCS is actively exploring novel approaches to application profiling by leveraging Extended Berkeley Packet Filter (eBPF) technologies. This effort aims to provide low-overhead, dynamic instrumentation capabilities within the kernel space, enabling real-time visibility into user-space application behavior without requiring code modifications.
This presentation will showcase the design and implementation of CSCS’s observability platform, offering a scalable and future-ready foundation for monitoring and managing the complexity of next-generation HPC systems.

