As the demand for AI workloads surges, there’s a pressing need to adapt existing High-Performance Computing infrastructures to accommodate the requirements of this new emerging user community. While many HPC users already run AI workloads, the influx of new AI users with little exposure to HPC, demands a more accessible and interactive platform. While Kubernetes is emerging as a solution for AI workloads, significant challenges arise as soon as the AI workloads need to scale.
Addressing the needs of this growing user base goes beyond standard Kubernetes solutions. This talk delves into the challenges and solutions to integrating AI workloads with HPC infrastructures. It discusses leveraging existing HPC solutions, such as Warewulf for provisioning Kubernetes alongside Slurm clusters, enabling Slurm to balance the resources effectively for both Kubernetes and traditional HPC workloads based on the load. This talk will also introduce the work done by HPCNow! to enhance performance and efficiency for AI workloads on Kubernetes.
Looking towards the future, the talk also examines the development of more mature and suitable solutions for AI and HPC on Kubernetes.