About Abaqus

The Abaqus Unified FEA product suite offers powerful and complete solutions for both routine and sophisticated engineering problems covering a vast spectrum of industrial applications. In the automotive industry engineering work groups are able to consider full vehicle loads, dynamic vibration, multibody systems, impact/crash, nonlinear static, thermal coupling, and acoustic-structural coupling using a common model data structure and integrated solver technology. Best-in-class companies are taking advantage of Abaqus Unified FEA to consolidate their processes and tools, reduce costs and inefficiencies, and gain a competitive advantage.

Official website: http://www.3ds.com/products-services/simulia/products/abaqus/

Disclaimer: The article uses optimization configuration, specific for the particular computational environment. In order to produce optimised package for a different combination of hardware, software and parallelisation environments some changes may be required.

 

Abstract

The document explains how to enable Abaqus Checkpointing & Restart (C&R) with the Slurm Workload Manager. This feature can be used to minimise the impact of a hardware issue, by restarting the job from the last checkpoint. Additionally, native C&R can also be used to take advantage of the Slurm job preemption based on re-queueing mechanism. In general, the C&R relies on fast shared filesystem, since extensive use of C&R could introduce an unreasonable overhead due to slow filesystem operations. The following software packages were used on a RHEL 6.3 operating system:

  • Slurm Workload Manager 14.11
  • Abaqus 6.13-2
  • IBM Platform MPI 8.3

Implement restart options in the input file

Running an analysis with Abaqus, you are allowed to write the model definition and state to the files required for restart. Scenarios for using the restart capability are as follows:

  • Changing an analysis: After viewing results from the previous analysis, you might need to change the load history data from an intermediate point. In such a case Abaqus allows you to restart the analysis from that point.
  • Continuing with additional steps: Sometimes, having viewed the results of a successful analysis, you might decide to append steps to the load history.
  • Continuing an interrupted run: The Abaqus restart analysis capability allows the interrupted analysis to be completed as originally defined.

This article covers the last scenario mentioned above: continuing an interrupted run. Abaqus allows to use restart files in order to continue an analysis from a specified step of a previous analysis. By default, Abaqus is not going to dump any restart information for an Abaqus/Standard or an Abaqus/CFD analysis into your filesystem. In the case of Abaqus/Explicit analysis, Abaqus will write only at the beginning and end of each step.

Abaqus allows to specify the frequency at which software will write data to the restart files, but unfortunately, the way of restart the analysis depends on each analysis products.

  • For an Abaqus/Standard step, it’s possible to choose whether the output is written at the exact time interval (or at the closest value). Restart at exact time intervals is available only for steps with an automatic time incrementation.
  • For Abaqus/Standard and Abaqus/CFD, it’s possible to request the frequency in increments or in time intervals.
  • For an Abaqus/Explicit analysis, it’s possible to specify the number of time intervals at which Abaqus writes data to the restart files.
  • For an Abaqus/Explicit step it’s possible to choose whether the output is written at the exact time interval (or at the closest value).
  • For an Abaqus/Standard or an Abaqus/Explicit analysis, it’s possible to request that data written to the restart files overlay data from the previous increment. This option will retain the information from last increment, avoiding this way to keep unnecessary files and minimizing the shared filesystem usage. Note that by default, Abaqus does not overlay data.

Abaqus restartjoin option allows to extract data from the output database created by a restart analysis and append the data to a second output database. This operation may depend on the size of your model, and according to the initial tests, it doesn’t seem a really expensive filesystem operation. This operation seems to be quite sensitive and sometimes it’s not possible to generate the second output database due to corrupted database. For that reason, it could be a good practice to keep a backup of the last two increments.

Again, the way to restart an analysis will depend on the analysis products.

  • Recover option : only available for Abaqus/Explicit.
  • Restart option : used to start the analysis using data from a previous analysis of a specified model.

Case Abaqus/Standard

The following setup will be valid for workflows based on reliability. However, in addition to that, thanks to the slurm re-queueing mechanism, one can take advantage of job preemption. This will provide clear benefits in front of suspension mechanism. For example, the ability to:

  • Avoid to steal unnecessary (shared) Abaqus tokens for idle (suspended) jobs.
  • Ability to run long term jobs minimising the impact of potential hardware failure.
  • Ability to divide long term job in several incremental runs of the same analysis.
  • Ability to migrate jobs into more suitable resources.

To request that restart data be written for an analysis, your Abaqus input file must contain the following line after the line **OUTPUT REQUESTS. We aware that “frequency=1” means “write at every increment”. If this is too frequent (i.e.: excessive file output is causing too much delay), increase the number.

 *Restart, write, overlay, frequency=1 

Workflow based re-queueing mechanism

The following submit script allows to submit a job in Slurm Workload Manager. This script will allow to run the job at least for one hour per run and the job is going to be restarted when it's needed. Please, note that you only need to worry about the standard job definition parameters, the job name, input file and the frequency that you have setup in your input files.This script assumes that you have already defined temporary cluster file system capable to digest the intensive file system operations. In this case $CHK_DIR is created in the job prolog with the proper ACLs, and the environment variable is defined in the task prolog.

The Slurm option --time-min is key to ensure progress in the analysis. It defines the minimum time limit on the job allocation. In this case, this value should be an upper limit for the time required to perform the compute of checkpointtable cycle(s), in addition to the time required to dump the information into the shared filesystem. An underestimated value could potentially allow Slurm to kill the job prematurely.

 
#!/bin/bash
#SBATCH -J Abaqus_JOB-CHECKPOINT
#SBATCH -A nesi99999
#SBATCH --time=10:00:00
#SBATCH --ntasks=4
#SBATCH --mem-per-cpu=2048
#SBATCH --open-mode=append
#SBATCH -p requeue
#SBATCH --time-min=01:00:00
###  Load the Environment
module load ABAQUS/6.13.2-linux-x86_64
source /share/SubmitScripts/slurm/slurm_setup_abaqus-env.sh
### In many cases you only need to worry about the following two lines
JOBNAME=job-checkpoint
INPUT=job-checkpoint
FREQ=1
###  Copying files to CHK folder (global scratch file system)
if ! [ -d $CHK_DIR/$JOBNAME ]; then
    mkdir $CHK_DIR/$JOBNAME
    cp $INPUT.inp $CHK_DIR/$JOBNAME/
fi

cd $CHK_DIR/$JOBNAME
if [[ -f Res_$INPUT.sta ]]; then
    rm -f *.lck
    abaqus restartjoin originalodb=$INPUT restartodb=Res_$INPUT history
    for i in res mdl stt prt sim sta com cid 023 dat msg
    do
       mv Res_$INPUT.$i $INPUT.$i
    done
fi
###  Run the Parallel Program
if [[ -f $INPUT.sta ]]  || [[ -f Res_$INPUT.sta ]]; then
    echo "*Heading" > Res_$INPUT.inp
    cat $INPUT.sta | gawk -v freq=$FREQ '{if($3 !~/U/){print "*Restart, read, step="$1",inc="$2", write, overlay, frequency="freq}}' | tail -1 >> Res_$INPUT.inp
    abaqus job=Res_$JOBNAME input=Res_$INPUT.inp oldjob=$JOBNAME cpus=$SLURM_NTASKS -verbose 3 standard_parallel=all mp_mode=mpi interactive
else
    abaqus job=$JOBNAME input=$INPUT.inp         cpus=$SLURM_NTASKS -verbose 3 standard_parallel=all  mp_mode=mpi interactive
fi
###  Transfer output files back to the project folder
cp *.dat $SLURM_SUBMIT_DIR/
cp *.msg $SLURM_SUBMIT_DIR/
cp *.sta $SLURM_SUBMIT_DIR/

Evaluate the impact on the cluster file system and analysis runtime

The benchmark results based on Abaqus/Standard Dynamic Implicit analysis show low impact in the cluster file system, even in with the highest frequency1

Since Abaqus cannot run with close interaction with srun, the slurm native profiling tools are not useful. For that reason, the usage metrics have been collected with a custom version of dstat and plugins related with the used cluster file system (GPFS). In this case, the job have been re-queued every 300 seconds, except the first one which was delivered after 320 seconds the analysis started to run. The following figures represent the pics of cluster file system usage and the CPU load.

In this particular case, the impact of restartjoin operation in terms of runtime is definitely a minor drawback (in many cases less than five seconds). However, in terms of IO, the restartjoin operation involves approximately 280MB of data, which represents the most expensive file system operation.

Figure 1 : effective throughput involved in the checkpoint and restart of Abaqus/Standard Dynamic Implicit Analysis.

 

Figure 2 : CPU usage involved in the checkpoint and restart of Abaqus/Standard Dynamic Implicit Analysis.

 

Please, help us to improve the quality and the objectivity of these articles. We encourage you to send feedback to the authors and reviewers and suggest new ways to get more performance.

Authors: Jordi Blasco (Landcare Research @ NeSI / HPCNow!)

Reviewers: Gene Soudlenkov (Center for eResearch - University of Auckland)

Bart Verleye (Vrije Universiteit Brussel, Belgium)

Acknowledgment : We would like to gratefully and sincerely thank the valuable contribution of Angel Ashikov, who is a PhD student in Civil Engineering at The University of Auckland. He has been decisive for developing this workflow.

References : Simulia Abaqus 6.13 “Abaqus Analysis User’s Manual”.

 

1 remember that for Abaqus “frequency=1” means “write at every increment”