High-Performance Cluster for Biomedical Research Using 10 Gigabit Ethernet iWARP Fabric

ManualsBrandsIntel ManualsOtherIntel 10 Gigabit XF SR Server Adapter

The cluster is used to run a variety of

workloads, such as image analysis, various

bioinformatics software and tools, CFD

modeling, computational chemistry

software, and many other open source,

commercial, and in-house applications.

The cluster is designed to meet all of the

current scientic computational demands

as well as provide a platform that will be

able to handle other kinds of workloads

over the cluster’s lifespan.

The cluster topology, which is shown in

Figure 1, consists of 14 server racks with

36 servers per rack, for a total of 504

servers. At the rack level, each server has

two connections to one of two 48-port,

1U Arista 7148SX switches: one 10GbE

link (using direct-attach Twinax cable) for

RDMA trafc and one GbE link for all other

trafc. Each Arista 7148SX switch has

eight 10GbE uplinks (16 per rack) to

a group of Arista 7xxx switches.

Software running on the cluster includes

Red Hat Enterprise Linux* 5.3, OFED

(OpenFabrics Enterprise Distribution)

1.4.1, and Intel® MPI (Message Passing

Interface) 3.2.1.

Using iWARP to Lower Overhead and

Latency in Multi-Gigabit Networks

Ethernet sales volume makes it extremely

cost-effective for general-purpose local

area network trafc, but its suitability

as the underlying topology for high-

performance compute clusters created

a series of challenges that had to be

met. The rst of these was for line rate

to reach a sufciently high level, which

has been achieved with the mainstream

availability of 10GbE networking equipment.

To take full advantage of 10GbE line rate,

however, the latency related to Ethernet

networking had to be overcome. iWARP

species a standard set of extensions to

TCP/IP that dene

a transport mechanism for RDMA.

As such, iWARP provides a low-latency

means of passing RDMA over Ethernet,

as depicted in Figure 2:

• Delivering a Kernel-Bypass Solution.

Placing data directly in user space

avoids kernel-to-user context switches,

reducing latency and processor load.

• Eliminating Intermediate Buffer

Copies. Data is placed directly in

application buffers rather than being

copied multiple times to driver and

network stack buffers, reducing latency

as well as memory and processor usage.

• Accelerated TCP/IP (Transport)

Processing. TCP/IP processing is done

in hardware instead of the operating

system network stack software, enabling

reliable connection processing at speed

and scale.

Arista* 7148SX Switches

(two per rack, 48 ports each)

Dell* Power Edge*

R610 Servers

(36 per rack)

Arista* 7XXX Switches

Eight 10 GbE fibre optic

uplinks per Arista* 7148SX

switch (16 per track)

Rack 1 Rack 2 Rack 14

Each server is uplinked to one of the Arista 7148SX rack switches with

one 10 GbE link (for RDMA) and one 1 GbE link (for all other traffic)

...

Without

iWARP

With

iWARP

Application

I/O Lib

OS Stack

Sys Driver

PCIe*

iWARP

TCP/IP

Hardware

User

Kernel

S/W

H/W

iWARP

Network

Controller

Flow

Basic

Network

Controller

Flow

Avoid

Copies

& Kernel

Figure 1. The cluster consists of 504 servers with two quad-core processors each, uplinked to two

rack-level switches per rack, which are uplinked to a central network fabric.

Figure 2. iWARP improves throughput by

reducing the overhead associated with kernel-

to-user context switches, intermediate buffer

copies, and TCP/IP processing.

A High-Performance Cluster for Biomedical Research Using 10 Gigabit Ethernet iWARP Fabric