White Papers

Dell - Internal Use - Confidential
Accelerating HPL using the Intel Xeon Phi
7120P Coprocessors
Authors: Saeed Iqbal and Deepthi Cherlopalle
The Intel Xeon Phi Series can be used to accelerate HPC applications in the C4130. The highly parallel architecture
on Phi Coprocessors can boost the parallel applications. These coprocessors work seamlessly with the standard
Xeon E5 processors series to provide additional parallel hardware to boost parallel applications. A key benefit of
the Xeon Phi series is that these don’t require redesigning the application, only compiler directives are required to
be able to use the Xeon Phi coprocessor.
Fundamentally, the Intel Xeon series are many-core parallel processors, with each core having a dedicated L2
cache. The cores are connected through a bi-directional ring interconnects. Intel offers a complete set of
development, performance monitoring and tuning tools through its Parallel Studio and VTune. The goal is to
enable HPC users to get advantage from the parallel hardware with minimal changes to the code.
The Xeon Phi has two modes of operation, the offload mode and native mode. In the offload mode designed
parts of the application are “offloaded” to the Xeon Phi, if available in the server. Required code and data is
copied from a host to the coprocessor, processing is done parallel in the Phi coprocessor and results move back to
the host. There are two kinds of offload modes, non-shared and virtual-shared memory modes. Each offload
mode offers different levels of user control on data movement to and from the coprocessor and incurs different
types of overheads. In the native mode, the application runs on both host and Xeon Phi simultaneously,
communication required data among themselves as need. A good reference on Xeon Phi and modes can be found
here.
The Intel Xeon Phi 7120P coprocessor has the highest performance among the Phi series. It has 61 cores and is
rated at 1.2 TFLOPS and can handle 244 threads. The 7120P also has the Intel Turbo Boost technology. Bulk of
the compute intensive calculations are done on the coprocessors.
The PowerEdge C4130 offers five configurations “A” through “E”. Among these configurations there are two
balanced configurations. The two balanced configurations “C” and “D” are considered for acceleration in this blog.
Configuration “C” is the balanced four coprocessor option with two coprocessors attached to each host processor,
and configuration “Dhas a single Xeon Phi attached to the each host processor. Table 1 gives more details of
these configurations. The details of the two configurations are shown in the Table 1. The block diagram of
configuration “C” and “D” is shown in Figure 1.
This blog shows the results of acceleration observed on the C4130 with Intel Xeon Phi 7120P in configuration “C”
and “D”.

Summary of content (4 pages)