Reference Guide

12 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

the Isilon with applications installed on the local NFS. The performance comparison between Isilon and other

storage solutions are shown in Section 3.1.6. The specifications of the Isilon F800 are listed in Table 4.

Table 4: Specification of Isilon F800

Storage

External storage

Bandwidth

IOPS

Chassis Capacity

(4 RU)

Cluster Capacity

Network

Before doing deep learning model training, if a user wants to move very large data outside the cluster described

in Section 2 to Isilon, the user can connect the server which stores the data to the FDR-40GigE gateway in

Figure 2, so that the data can be moved onto Isilon without having to route it through the head node.

To monitor and analyze the performance and file system of Isilon storage, the tool InsightIQ can be used.

InsightIQ allows a user to monitor and analyze Isilon storage cluster activity using standard reports in the

InsightIQ web-based application. The user can customize these reports to provide information about storage

cluster hardware, software, and protocol operations. InsightIQ transforms data into visual information that

highlights performance outliers, and helps users diagnose bottlenecks and optimize workflows. In Section 3.1.5,

InsightIQ was used to collect the average disk operation size, disk read IOPS, and file system throughput when

running deep learning models. For more details about InsightIQ, refer to Isilon InsightIQ User Guide.

2.6 Network

The solution comprises of three network fabrics. The head node and all compute nodes are connected with a

1 Gigabit Ethernet fabric. The Ethernet switch recommended for this is the Dell Networking S3048-ON which

has 48 ports. This connection is primarily used by Bright Cluster Manager for deployment, maintenance and

monitoring the solution.

The second fabric connects the head node and all compute nodes are through 100 Gb/s EDR InfiniBand. The

EDR InfiniBand switch is Mellanox SB7800 which has 36 ports. This fabric is used for IPC by the applications

as well as to serve NFS from the head node (IPoIB) and Isilon. GPU-to-GPU communication across servers

can use a technique called GPUDirect Remote Direct Memory Access (RDMA) which is enabled by InfiniBand.

This enables GPUs to communicate directly without the involvement of CPUs. Without GPUDirect, when GPUs

across servers need to communicate, the GPU in one node has to copy data from its GPU memory to system

memory, then that data is sent to the system memory of another node over the network, and finally the data is

copied from the system memory of the second node to the receiving GPU memory. With GPUDirect however,

the GPU on one node can send the data directly from its GPU memory to the GPU memory in another node,

without going through the system memory in both nodes. Therefore GPUDirect decreases the GPU-GPU

communication latency significantly.