AI training workloads require infrastructure that mirrors the demands seen in traditional high-performance computing (HPC) over decades: high performance computing nodes, high-speed and low-latency interconnects, and fast, low-latency access to a unified storage namespace.
As a result, AI training tasks are still today largely performed using orchestration, computing, and storage infrastructure paradigms similar to those found in traditional supercomputing environments.
Cloud computing brings out of the box the possibility in the AI training domain to execute these workflows with the standard set of tools, schedulers and methodologies consolidated in on-premises system.
Simultaneously, it offers opportunities to modernize and enhance these workflows, leading to more efficient resource utilization across all infrastructure layers, including compute and storage.
This can be achieved through technologies such as containerization, PaaS services and object storage.
In this session, the audience will receive an overview of transitioning AI workloads from a traditional HPC infrastructure to a cloud-native approach using a phased methodology.
The discussion will highlight the benefits involved in running AI workloads in the cloud, suggesting how, through a non-disruptive approach it is possible to add incremental value and modernization elements to the classical training workflows.

