HP Fabric Clustering System for InfiniBand™ Topologies Document overview.............................................................................................................................. 2 Document structure ........................................................................................................................... 2 General InfiniBand Concepts and Definitions .......................................................................................... 2 Methodology ................
Document overview This document describes the results of testing various HP Fabric Clustering System for InfiniBand system and switch configurations with the intent to provide customers with the information needed to determine which switch topology meets their budget and performance goals. Document structure This document is organized as follows: Section 1 contains the overall purpose and organization of this paper.
process per node and two processes per node. The key representative tests from the PMB suite we have examined are as follows: • Single Transfer Benchmarks o PingPong o PingPing • Parallel Transfer Benchmarks o SendRecv o Exchange • Collective Benchmarks o Reduce N-to-1 o Alltoall N-to-N o Bcast 1-to-N o Barrier test Our intent has been to yield recommendations for size of cluster (i.e. number of compute nodes) and configurations of switches specifically when using AB291A 12-port switches as building blocks.
Fabric Clustering System Configurations and Results 12-node cluster TS170 configuration The following configuration is used to baseline the TS170 performance versus the AB291A performance.
AB291A configuration The following configuration is used to build the 12-node comparison cluster using only the single AB291A switch: PWR 2 PCI-X 133 Infiniband PWR 1 PCI-X 133 Management Card LAN 10/100 CONSOLE VGA GSP RESETS SOFT CONSOLE / REMOTE / UPS PCI-X 133 HARD SERIAL A PCI-X 133 SCSI LVD/SE TOC LAN 10/100 LAN Gb PWR 2 rx2600 PCI-X 133 rx2600 AB291A CONSOLE VGA GSP RESETS SOFT CONSOLE / REMOTE / UPS PCI-X 133 HARD SERIAL A PCI-X 133 SERIAL B USB PCI-X 133 Infiniband PWR
Results Each of the following graphs shows data from a single-process PMB test run on both of the above configurations. PingPong PMB PingPong 10000 800 Latency (usec) 1000 600 500 100 400 300 10 200 Bandwidth (Mbytes/sec) 700 AB291A latency (usec) TS170 latency (usec) AB291A bandw idth (Mbytes/sec) TS170 bandw idth (Mbytes/sec) 100 1 1.00 100.00 0 10,000.00 1,000,000.0 100,000,00 0 0.
PingPing PMB PingPing 1000 Latency (usec) 10000 100 1000 10 100 1 10 1 1.00 100.00 AB291A latency (usec) Bandwidth (Mbytes/sec) 100000 TS170 latency (usec) AB291A bandw idth (Mbytes/sec) TS170 bandw idth (Mbytes/sec) 0.1 10,000.00 1,000,000. 100,000,00 00 0.
Exchange PMB Exchange (12 procs) 100000 900 700 Latency (usec) 600 1000 500 400 100 300 200 10 Bandwidth (Mbytes/sec) 800 10000 AB291A latency (usec) TS170 latency (usec) AB291A bandw idth (Mbytes/sec) TS170 bandw idth (Mbytes/sec) 100 1 1.00 100.00 0 10,000.00 1,000,000.0 100,000,00 0 0.00 Message size (bytes) Reduce PMB Reduce (12 procs) 100000 10000 Latency (usec) 1000 100 AB291A latency (usec) TS170 latency (usec) 10 1 0.1 0.01 1.00 10.00 100.00 1,000.00 10,000.00 100,000.
Alltoall PMB Alltoall (12 procs) 1000000 100000 Latency (usec) 10000 1000 AB291A latency (usec) 100 TS170 latency (usec) 10 1 0.1 0.01 1.00 10.00 100.00 1,000.00 10,000.0 100,000. 1,000,00 10,000,0 0 00 0.00 00.00 Message size (bytes) Bcast PMB Bcast (12 procs) 100000 10000 Latency (usec) 1000 100 AB291A latency (usec) TS170 latency (usec) 10 1 0.1 0.01 1.00 10.00 100.00 1,000.00 10,000.0 100,000. 1,000,00 10,000,0 0 00 0.00 00.
Barrier PMB Barrier 60 50 Latency (usec) 40 AB291A latency (usec) 30 TS170 latency (usec) 20 10 0 0 5 10 15 Processes Summary Performance of the 12-port AB291A is equivalent or better than that of a single TS170 12-port line card. This is expected, since a CLOS switch requires more switch hops than the non-CLOS AB291A architecture.
AB291A console AB291A mgmt-eth 1 2 3 4 5 6 7 8 9 10 11 12 rx2600 PWR 2 PCI-X 133 Infiniband PWR 1 PCI-X 133 Management Card LAN 10/100 CONSOLE VGA GSP RESETS SOFT CONSOLE / REMOTE / UPS HARD PCI-X 133 SERIAL A PCI-X 133 SCSI LVD/SE LAN 10/100 LAN Gb PWR 2 TOC USB SERIAL B PCI-X 133 Infiniband PWR 1 PCI-X 133 Management Card LAN 10/100 CONSOLE VGA GSP RESETS SOFT CONSOLE / REMOTE / UPS HARD PCI-X 133 SERIAL A PCI-X 133 SCSI LVD/SE LAN 10/100 LAN Gb PWR 2 TOC USB
AB291A configuration The following configuration is used to construct an 18-node cluster from two cascaded AB291A 12port switches: PWR 2 PWR 2 PCI-X 133 Infiniband PWR 1 PCI-X 133 Management Card GSP RESETS SOFT CONSOLE / REMOTE / UPS PWR 2 GSP RESETS SOFT CONSOLE / REMOTE / UPS HARD PCI-X 133 SERIAL A SERIAL A TOC LAN 10/100 LAN Gb VGA PCI-X 133 HARD PCI-X 133 PCI-X 133 SCSI LVD/SE CONSOLE LAN 10/100 CONSOLE VGA PCI-X 133 Infiniband PWR 1 PCI-X 133 Management Card LAN 10/100 SCSI
Detail of switch interconnection: The two AB291A cascaded switches should be interconnected as detailed below: console 1 mgmt-eth 2 3 4 5 6 7 8 9 10 11 12 console mgmt-eth AB291A Port 5 to Port 11 Port 3 to Port 9 AB291A Port 1 to Port 7 AB291A 1 2 3 4 5 6 7 8 9 10 11 12 AB291A 13
Results Each of the following graphs shows data from a single-process PMB test run on both of the above configurations. PingPong PMB PingPong 800 10000 Latency (usec) 500 100 400 300 10 200 Bandwidth (Mbytes/sec) 600 1000 AB291A latency (usec) Bandwidth (Mbytes/sec) 700 AB291A latency (usec) TS170 latency (usec) AB291A bandw idth (Mbytes/sec) TS170 bandw idth (Mbytes/sec) 100 1 1.00 100.00 0 10,000.00 1,000,000.0 100,000,00 0 0.
SendRecv PMB SendRecv (18 procs) 1000 100000 10000 Bandwidth (Mbytes/sec) Latency (usec) 100 1000 10 100 AB291A latency (usec) TS170 latency (usec) AB291A bandwidth (Mbytes/sec) TS170 bandwidth (Mbytes/sec) 1 10 1 1.00 10.00 0.1 100.00 1,000.00 10,000.0 100,000. 1,000,00 10,000,0 0 00 0.00 00.
Reduce PMB Reduce (18 procs) 100000 10000 Latency (usec) 1000 100 AB291A Latency (usec) TS170 Latency (usec) 10 1 0.1 0.01 1.00 10.00 100.00 1,000.00 10,000.0 100,000. 1,000,00 10,000,0 0 00 0.00 00.00 Message size (bytes) Alltoall PMB Alltoall (18 procs) 1000000 100000 Latency (usec) 10000 1000 AB291A latency (usec) 100 TS170 latency (usec) 10 1 0.1 0.01 1.00 10.00 100.00 1,000.00 10,000.0 100,000. 1,000,00 10,000,0 0 00 0.00 00.
Bcast PMB Bcast (18 procs) 100000 10000 Latency (usec) 1000 100 AB291A Latency (usec) TS170 Latency (usec) 10 1 0.1 0.01 1.00 10.00 100.00 1,000.00 10,000.0 100,000. 1,000,00 10,000,0 0 00 0.00 00.
Summary Performance of the 18-port cluster built from the two cascaded AB291A 12-port switches is equivalent to or slightly better than that of the TS170. HP has run numerous other High Performance Technical Computing applications on the same two architectures to try to discover if there are flaws in the assumptions in this methodology. In all cases tested, the result has shown that the performance yielded by the cascaded AB291A switches is comparable to that of the TS170 for 18-node clusters.
For more information Refer to whitepapers on InfiniBand found at docs.hp.com. www.hp.com/go/mpi www.infinibandta.