Articles > August 1, 2023 Jordi Blasco

How to use the Slurm simulator as a development and testing environment

The Slurm simulator represents an extremely powerful tool for evaluating different configuration options and changes in policies or limits before implementing them in a production cluster. In this article, I’m going to explain what I have been doing with a Slurm simulator over the last 10 years, the benefits of adopting that technology, and how you can use it for your daily operations.

The Slurm simulator consumes little CPU and memory resources. It’s maintained by HPCNow! and distributed free of charge as a Docker image from Docker Hub.

Figure 1: The memory footprint and CPU load of the Slurm Simulator are small enough to fit in a Raspberry Pi or a small VM. This image shows a Raspberry Pi Zero W with a Slurm simulator in a Docker image, able to emulate the current Top 1 system in the Top500 list. The original 3D-printed Cray Y-MP case is available on Thingiverse.

Motivation

For an HPC consulting company like HPCNow! the Slurm simulator allows us to validate the clients’ configuration of the workload manager before they even have the hardware ready. The first step in achieving the highest level of efficiency in the cluster starts with the workload manager. Based on previous usage metrics, the ability to use cloud resources as a cloud-bursting strategy, and thanks to the input of discovery workshops with the end-users, we evaluate multiple options. Finally, we provide a suggested configuration and a demonstration of the expected behavior under different scenarios after implementing mechanisms to mitigate cluster congestion, starvation, domination, hogging, etc.

The use of the Slurm Simulator inside our team goes beyond this initial motivation. The following list provides a quick overview of the main use cases inside our company.

  1. Development and testing environment for custom Slurm plugins. We develop a set of plugins, including Job Submit plugins, Burst Buffers plugins, job efficiency plugins, and more. The Slurm simulator allows us to build the environment required for developing, testing, and implementing CI/CD pipelines.
  2. Test new functionalities. We automated the process of building a Slurm simulator using the Bitbucket pipelines and local resources. Thanks to CI/CD, we can test new functionalities available in new stable releases or even in release candidates. This approach lets us test and evaluate new features before we install them in a production environment. 
  3. Validate current configuration in a newer Slurm version. The simulator also allows us to validate our very mature configuration and adjust it, if required, in order to use a new feature or to deprecate some old parameters.
  4. Perfect environment for Slurm Administration training courses. Each student has access to a dedicated Docker container instance where he/she can follow the hands-on oriented training course. This setup is perfect because you can simulate an environment exactly like the one you have or an even bigger system than the one on the top1 of the top500.
  5. Perfect environment for running technical tests as part of hiring processes. We use a Slurm simulator along with a parallel development environment for testing the skills of potential candidates. This tool has been vital for assessing the knowledge of computational scientists, HPC DevOps or HPC system administrators applying for a job at HPCNow!
  6. Development and testing environments for third-party software packages. We use the Slurm simulator as a development and testing environment for implementing tools like Nextflow, JupyterHub, RStudio, our job efficiency monitoring tools or our cloud bursting tools without the need for real cluster resources.

Features and expectations

Before you start using our simulator, you should know that there are other implementations of the Slurm simulator. The one originally developed at BSC, required a lot of effort in patching the code. Unfortunately, the person who was originally behind that project (Alejandro Lucero) is no longer involved in HPC, and it became outdated years ago. Stephen Trofinoff and Massimo Benini, from CSCS, invested a lot of effort in updating the code simulator, which was introduced in SLUG’15. Finally, Marco D’Amico, Ana Jokanovic and Julita Corbalan, from BSC, updated the original simulator code and provided an overview of the improvements in SLUG’18. The main goal of this type of implementation is to simulate a large number of workload executions in a short time.

The version that we maintain is not meant to simulate workload execution but to provide a complete Slurm environment to test close-to-production configuration, test plugins, and new features as you would in a real environment. This simulator represents one of our multiple contributions to the open-source HPC community. 

These are the key features of our Slurm simulator:

  • Supports job preemption based on memory suspension and requeue mechanisms.
  • Supports multiple users, so you can simulate competition between different users and accounts, etc.
  • Supports all kinds of reservations.
  • Since it is a complete Slurm system, it can run real jobs. Obviously, we recommend using sleep commands or very light commands to simulate real jobs.
  • It supports accounting.
  • It supports common plugins, including job_submit plugins.
  • It supports prologs and epilogs.

How to use it

Some time ago, I used to maintain a VM with SuSE Studio. After that service was shut down, I migrated it to a Docker container. That opened up a whole new level of possibilities and happiness.

I hope that at this stage, you are excited to get started with this Docker container. The following instructions will help you to create a Slurm simulator and modify it in a way that can represent your cluster or the size of the cluster that you want.

At the time this article was published, Slurm had some unresolved bugs with the use of frontend mode as described on the SchedMD website affecting versions 21.08.8-2 and 22.05.5. For that reason, this article focuses on the latest version with a working front-end mode (20.11.9).

1. Pull the image

docker pull hpcnow/slurm_simulator:20.11.9

2. Run the Docker container with the following options:

docker run --rm --detach \
           --name "${USER}_simulator" \
           -h "slurm-simulator" \
           --security-opt seccomp:unconfined \
           --privileged -e container=docker \
           -v /run -v /sys/fs/cgroup:/sys/fs/cgroup \
           --cgroupns=host \
           hpcnow/slurm_simulator:20.11.9 /usr/sbin/init

3. Access the container with an interactive session

docker exec -ti ${USER}_simulator /bin/bash

4. Incorporate changes into the Slurm configuration

If you want to keep and re-use the Slurm configuration of your simulator, I suggest copying the content of /etc/slurm outside the container by using the following command:

docker cp ${USER}_simulator:/etc/slurm .

Then you can apply the changes in the files located in the Slurm folder recently transferred from the container and run the docker container again, exposing that folder in the right place, by using the option “-v ./slurm:/etc/slurm”.

docker stop slurm-simulator
docker run --rm --detach \
           --name "${USER}_simulator" \
           -h "slurm-simulator" \
           --security-opt seccomp:unconfined \
           --privileged -e container=docker \
           -v ./slurm:/etc/slurm \
           -v /run -v /sys/fs/cgroup:/sys/fs/cgroup \
           --cgroupns=host \
           hpcnow/slurm_simulator:20.11.9 /usr/sbin/init

Now you can play with your brand-new virtual cluster like it was a regular one. I suggest using sleep commands to simulate real jobs. You can take advantage of the “–wrap” option or use standard submit scripts as usual.

5. If you want to use the Slurm simulator to review the configuration for your production cluster, you can use the “port_to_simulator.sh” script available in the container. This script will update a standard Slurm configuration in order to run in a simulated environment. Obviously, I haven’t explored all the options. If the portability fails, consider increasing the verbosity level for slurmctld and slurmd in slurm.conf and check the logs.

Figure 2: Output of sinfo command in a Slurm simulator based on the configuration proposed to one of our customer

What are the mandatory configuration parameters in the Slurm configuration?

  • SlurmctldHost must be set to slurm-simulator
  • FrontendName must be set to slurm-simulator
  • The nodes definition must contain the following two parameters:
    • NodeAddr=slurm-simulator
    • NodeHostName=slurm-simulator.

If you want to simulate an extremely large system, consider disabling the following parameters: AccountingStorageType and TaskPlugin.

Author:  Jordi Blasco


Related Articles

Jordi Blasco

Enabling Abaqus Checkpointing & Restart with Slurm Workload Manager

How to enable this feature can be used to minimise the impact of a hardware issue by restarting the job from the last checkpoint.

Elisabeth Ortega, Ph. D

Improving efficiency in HPC clusters using monitoring tools

A n efficient HPC cluster is one that has resources ready to be used on demand. Users are happy when they get access to the resources they need immediately.

Diego Lasa

Integrating CI/CD pipelines with your workload manager with Gitlab runners and Jacamar CI

Running a Gitlab runner on your head node gives your users the option and flexibility to run pipelines using the built-in SSH or Shell executors in order to submit their jobs.

Visit our forum

One of the main goals of this project is to motivate new initiatives and collaborations in the HPC field. Visit our forum to share your knowledge and discuss with other HPC experts!

About us

HPCKP (High-Performance Computing Knowledge Portal) is an Open Knowledge project focused on technology transfer and knowledge sharing in the HPC, AI and Quantum Science fields.

Promo HPCNow