High-Performance Cluster for Biomedical Research Using 10 Gigabit Ethernet iWARP Fabric

The cluster is used to run a variety of
workloads, such as image analysis, various
bioinformatics software and tools, CFD
modeling, computational chemistry
software, and many other open source,
commercial, and in-house applications.
The cluster is designed to meet all of the
current scientic computational demands
as well as provide a platform that will be
able to handle other kinds of workloads
over the cluster’s lifespan.
The cluster topology, which is shown in
Figure 1, consists of 14 server racks with
36 servers per rack, for a total of 504
servers. At the rack level, each server has
two connections to one of two 48-port,
1U Arista 7148SX switches: one 10GbE
link (using direct-attach Twinax cable) for
RDMA trafc and one GbE link for all other
trafc. Each Arista 7148SX switch has
eight 10GbE uplinks (16 per rack) to
a group of Arista 7xxx switches.
Software running on the cluster includes
Red Hat Enterprise Linux* 5.3, OFED
(OpenFabrics Enterprise Distribution)
1.4.1, and Intel® MPI (Message Passing
Interface) 3.2.1.
Using iWARP to Lower Overhead and
Latency in Multi-Gigabit Networks
Ethernet sales volume makes it extremely
cost-effective for general-purpose local
area network trafc, but its suitability
as the underlying topology for high-
performance compute clusters created
a series of challenges that had to be
met. The rst of these was for line rate
to reach a sufciently high level, which
has been achieved with the mainstream
availability of 10GbE networking equipment.
To take full advantage of 10GbE line rate,
however, the latency related to Ethernet
networking had to be overcome. iWARP
species a standard set of extensions to
TCP/IP that dene
a transport mechanism for RDMA.
As such, iWARP provides a low-latency
means of passing RDMA over Ethernet,
as depicted in Figure 2:
Delivering a Kernel-Bypass Solution.
Placing data directly in user space
avoids kernel-to-user context switches,
reducing latency and processor load.
Eliminating Intermediate Buffer
Copies. Data is placed directly in
application buffers rather than being
copied multiple times to driver and
network stack buffers, reducing latency
as well as memory and processor usage.
Accelerated TCP/IP (Transport)
Processing. TCP/IP processing is done
in hardware instead of the operating
system network stack software, enabling
reliable connection processing at speed
and scale.
Arista* 7148SX Switches
(two per rack, 48 ports each)
Dell* Power Edge*
R610 Servers
(36 per rack)
Arista* 7XXX Switches
Eight 10 GbE fibre optic
uplinks per Arista* 7148SX
switch (16 per track)
Rack 1 Rack 2 Rack 14
Each server is uplinked to one of the Arista 7148SX rack switches with
one 10 GbE link (for RDMA) and one 1 GbE link (for all other traffic)
...
Without
iWARP
With
iWARP
Application
I/O Lib
OS Stack
Sys Driver
PCIe*
iWARP
TCP/IP
Hardware
User
Kernel
S/W
H/W
iWARP
Network
Controller
Flow
Basic
Network
Controller
Flow
Avoid
Copies
& Kernel
Figure 1. The cluster consists of 504 servers with two quad-core processors each, uplinked to two
rack-level switches per rack, which are uplinked to a central network fabric.
Figure 2. iWARP improves throughput by
reducing the overhead associated with kernel-
to-user context switches, intermediate buffer
copies, and TCP/IP processing.
2
A High-Performance Cluster for Biomedical Research Using 10 Gigabit Ethernet iWARP Fabric
2