About PhyML

PhyML is a software that estimates maximum likelihood phylogenies from alignments of nucleotide or amino acid sequences. The main strength of PhyML lies in the large number of substitution models coupled to various options to search the space of phylogenetic tree topologies, going from very fast and efficient methods to slower but generally more accurate approaches. PhyML was designed to process moderate to large data sets. In theory, alignments with up to 4,000 sequences 2,000,000 character-long can be processed.

Official website: https://code.google.com/p/phyml/

Disclaimer: The article uses optimization configuration, specific for the particular computational environment. In order to produce an optimised package for different combination of hardware, software and parallelisation environments some changes may be required.

Abstract

This document explains how to build the PhyML-20120412 program (without any patch) using Intel Cluster Studio XE 2013 with Intel MPI and MKL. The compile process explained below was performed on an Intel SandyBridge (E5-2680) node. A number of compilers, libraries and compiling options were evaluated to achieve performance gains of up to 5 times over the default configuration.

Environment Setup

The Intel Cluster Studio packages require configuration options to be set before use. A set of modules, which configure the default environment properties for these packages had been prepared. It is important to note that the Intel Cluster Studio XE 2013 module (intel/ics-2013) sets up the MKL environment, including the $MKLROOT variable, which is the path to the MKL libraries.

module load intel/ics-2013

In general, this is equivalent to loading all the environment variables that we need to compile, debug and profile this application using the following commands:

source /share/apps/intel/icsxe/2013.0.028/composer_xe/mkl/bin/mklvars.sh intel64
source /share/apps/intel/icsxe/2013.0.028/composer_xe/bin/compilervars.sh intel64
source /share/apps/intel/icsxe/2013.0.028/mpi/bin64/mpivars.sh
source /share/apps/intel/icsxe/2013.0.028/composer_xe/bin/idbvars.sh intel64
source /share/apps/intel/icsxe/2013.0.028/composer_xe/tbb/bin/tbbvars.sh intel64
source /share/apps/intel/itac/8.1.0.024/intel64/bin/itacvars.sh

Configuration and code patching

We provide the serial and parallel (MPI) versions of PhyML to researchers. This document will focus in both versions and in particular, we will study the scalability and performance for the MPI version. Some changes were made to the Makefile after the configuration step in order to integrate with ICS 2013.

Serial version

Configure

./configure CC=icc CFLAGS="-xhost -O2" LDFLAGS="-L/share/apps/intel/composer_xe_2013.1.117/mkl/lib/intel64" LIBS="-mkl" --prefix=/share/apps/PHYML/sandybridge/20120412/ics-2013

Editing the makefile

Very few changes need to be made to get success. The key variables to modify are:

CC = icc
CFLAGS = -xhost -Wall -O2 -fomit-frame-pointer -unroll0 -I${MKLROOT}/include -mkl
CPP = icc -E
LDFLAGS =
LIBS = -lm

To build the serial version, go to the Building the code section, and after that, if you want to compile the MPI version, come back to the Configuration and code patching section and continue from the MPI version.

MPI version

Configure

./configure CC=mpiicc CFLAGS="-xhost -O2 -vec-report5 -opt-report3 -opt-report-phase=all" LDFLAGS="-L/share/apps/intel/composer_xe_2013.1.117/mkl/lib/intel64" LIBS="-mkl" --prefix=/share/apps/PHYML/sandybridge/20120412/ics-2013 --enable-mpi

Editing the makefile

The key variables to modify are:

CC = mpiicc
CFLAGS = -xhost -Wall -O2 -fomit-frame-pointer -unroll0 -I${MKLROOT}/include -mkl
CPP = mpiicc -E
LDFLAGS =
LIBS = -lm

More aggressive optimization flags were tested but all of them decreased the performance or produced unexpected results, like performance degradation or segmentation fault.

--- src/Makefile.ics-2013-optimised	2013-04-09 12:13:37.630518000 +1200
+++ src/Makefile.ics-2013-orig 2013-04-09 10:59:14.025376000 +1200
@@ -365,12 +365,11 @@
AUTOHEADER = ${SHELL} /share/src/phyml-20120412-ics2013/missing --run autoheader
AUTOMAKE = ${SHELL} /share/src/phyml-20120412-ics2013/missing --run automake-1.11
AWK = gawk
-CC = mpiicc
+CC = mpicc
CCDEPMODE = depmode=gcc3
-#CFLAGS = -xhost -Wall -O2 -fomit-frame-pointer -unroll0 -vec-report5 -opt-report3 -opt-report-phase=all -I${MKLROOT}/include -mkl
-CFLAGS = -xhost -Wall -O2 -fomit-frame-pointer -unroll0 -I${MKLROOT}/include -mkl
+CFLAGS = -Wall -O2 -msse -fomit-frame-pointer -funroll-loops
CPP = mpiicc -E
CYGPATH_W = echo
DEFS = $(REVISION)
DEPDIR = .deps
@@ -385,9 +384,9 @@
INSTALL_PROGRAM = ${INSTALL}
INSTALL_SCRIPT = ${INSTALL}
INSTALL_STRIP_PROGRAM = $(install_sh) -c -s
-LDFLAGS =
+LDFLAGS =
LIBOBJS =
-LIBS = -lm
+LIBS = -lm -mkl
LTLIBOBJS =
MAKEINFO = ${SHELL} /share/src/phyml-20120412-ics2013/missing --run makeinfo
MKDIR_P = /bin/mkdir -p

Building the code

The following commands were used as specified in the installation guide in the PhyML documentation:

make |  tee -a make_phyml-20120412-ics2013.log
make check
make install

Testing the code

We have tested the results and scalability of PhyML using the two short tests provided by the project (nucleic & proteic).

Enter in the examples folder and execute :

cd examples
export OMP_NUM_THREADS=1
mpiexec.hydra -machinefile hosts -wdir $PWD -np 32 ../src/phyml-mpi -i nucleic -b 100

Once the application finishes, ensure the correctness of the results.

Setting up the modulefile

#%Module1.0
module-whatis "PhyML: Maximum likelihood phylogenies (version 20120412) Optimized for Intel SandyBridge"
module load intel/ics-2013
set root /share/apps/PHYML/sandybridge/20120412/ics-2013
prepend-path PATH $root/bin

Benchmark results and application scalability

Benchmarks are the only way to ensure that we are doing our job well and that we obtain the expected performance on given architecture. Otherwise, for efficiency reasons, we need to provide the application scalability for a particular hardware design. This can save cputime when the users start to submit the jobs for the first time, if they ask for more cores than the application is capable to scale.

Intel Cluster Studio XE 2013 delivers significant performance boost to PhyML on Intel SandyBridge and is around 5 times faster (with 36 cores on 2 nodes) compared to GNU Compilers (gcc 4.4.5) with OpenMPI-1.6 and the default configuration.

Please, help us to improve the quality and the objectivity of these articles. We encourage you to send feedback to the authors and reviewers and suggest new ways to get more performance.

Author: Jordi Blasco (Landcare Research / NeSI)

Reviewers: Jaime Huerta-Cepas (Comparative Genomics Group - CRG)

Sina Masoud-Ansari (Center for eResearch - University of Auckland)