Reference Guide

12 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
the Isilon with applications installed on the local NFS. The performance comparison between Isilon and other
storage solutions are shown in Section 3.1.6. The specifications of the Isilon F800 are listed in Table 4.
Table 4: Specification of Isilon F800
Storage
External storage
Bandwidth
IOPS
Chassis Capacity
(4 RU)
Cluster Capacity
Network
Before doing deep learning model training, if a user wants to move very large data outside the cluster described
in Section 2 to Isilon, the user can connect the server which stores the data to the FDR-40GigE gateway in
Figure 2, so that the data can be moved onto Isilon without having to route it through the head node.
To monitor and analyze the performance and file system of Isilon storage, the tool InsightIQ can be used.
InsightIQ allows a user to monitor and analyze Isilon storage cluster activity using standard reports in the
InsightIQ web-based application. The user can customize these reports to provide information about storage
cluster hardware, software, and protocol operations. InsightIQ transforms data into visual information that
highlights performance outliers, and helps users diagnose bottlenecks and optimize workflows. In Section 3.1.5,
InsightIQ was used to collect the average disk operation size, disk read IOPS, and file system throughput when
running deep learning models. For more details about InsightIQ, refer to Isilon InsightIQ User Guide.
2.6 Network
The solution comprises of three network fabrics. The head node and all compute nodes are connected with a
1 Gigabit Ethernet fabric. The Ethernet switch recommended for this is the Dell Networking S3048-ON which
has 48 ports. This connection is primarily used by Bright Cluster Manager for deployment, maintenance and
monitoring the solution.
The second fabric connects the head node and all compute nodes are through 100 Gb/s EDR InfiniBand. The
EDR InfiniBand switch is Mellanox SB7800 which has 36 ports. This fabric is used for IPC by the applications
as well as to serve NFS from the head node (IPoIB) and Isilon. GPU-to-GPU communication across servers
can use a technique called GPUDirect Remote Direct Memory Access (RDMA) which is enabled by InfiniBand.
This enables GPUs to communicate directly without the involvement of CPUs. Without GPUDirect, when GPUs
across servers need to communicate, the GPU in one node has to copy data from its GPU memory to system
memory, then that data is sent to the system memory of another node over the network, and finally the data is
copied from the system memory of the second node to the receiving GPU memory. With GPUDirect however,
the GPU on one node can send the data directly from its GPU memory to the GPU memory in another node,
without going through the system memory in both nodes. Therefore GPUDirect decreases the GPU-GPU
communication latency significantly.