Reference Guide

15 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

Table 5: The hardware and software in the testbed

Hardware Head node

Cluster head node

PowerEdge R740xd

CPU 2 x Intel Xeon 6148 @ 2.4GHz

Memory 384GB DDR4 @ 2667MT/s

Disks on head node 12 x12 TB Near-line SAS drives in a RAID 50 volume.

120TB volume formatted as XFS, exported via NFS

Hardware Compute node

Cluster compute node PowerEdge C4140

Number of compute nodes 8 nodes with V100-PCIe and 2 nodes with V100-SXM2

CPU 2 x Intel Xeon 6148 @ 2.4GHz

Memory 384GB DDR4 @ 2667MT/s

Disks 2x M.2, 240GB in Raid1

GPU V100-SXM2, V100-PCIe

Software and Firmware

Operating System Red Hat Enterprise Linux 7.4

Linux Kernel 3.10.0-693.el7.x86_64

BIOS 1.1.6

CUDA compiler and GPU driver CUDA 9.1.85 (390.46)

Python 2.7.5

Deep Learning Datasets

Dataset for training ILSVRC2012 training dataset, 1,281,167 images

Dataset for inference ILSVRC2012 validation dataset, 50,000 images

Deep Learning Libraries and Frameworks

CUDNN 7.0

NCCL 2.1.15

Horovod 0.12.1

TensorFlow 1.8

MXNet 0.11.1

Caffe2 0.8.1+

TensorRT 4.0.0.3

3.1.1 FP16 vs FP32

The V100 GPUs contains a new type of processing core called Tensor Cores which support mixed precision

training. Although many High Performance Computing (HPC) applications require high precision computation

with FP32 (32-bit floating point) or FP64 (64-bit floating point), Deep Learning researchers have found they are

able to achieve the same inference accuracy with FP16 (16-bit floating point) as can be had with FP32. In this