As the number of transistors on a chip increases on every generation, old processor design recipes are less valuable to keep power consumption to reasonable levels. Today’s processors are somehow less focused on clock frequency, ILP (instruction level parallelism) and single thread performance, in favor of other types of parallelism as DLP (data level parallelism) and TLP (thread level parallelism). As a result, HPC applications cannot only rely on the compiler and the micro-architecture anymore, but it is programmer’s responsibility to explicitly express parallelism in order to exploit the full performance capabilities of the system underneath, even on a single node.
In this work we will learn tips, advices, and best known methods for parallelizing and optimizing existing HPC applications on Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor. We will see how to apply different programming models to get the best performance out of these platforms, and also identify when a particular application would perform better on the processor, the coprocessor, or in a hybrid scheme.