Reference Guide

Dell EMC Ready Solutions for AI - Deep Learning with NVIDIA an Architecture Guide | v1.0
Table of Contents
Revisions............................................................................................................................................................................. 2
Table of Contents ................................................................................................................................................................ 3
Executive summary ............................................................................................................................................................. 4
1 Solution Overview ........................................................................................................................................................ 5
2 Solution Architecture .................................................................................................................................................... 7
2.1 Head Node Configuration ................................................................................................................................... 7
2.1.1 Shared Storage via NFS over InfiniBand ........................................................................................................... 8
2.2 Compute Node Configuration ............................................................................................................................. 8
2.2.1 GPU .................................................................................................................................................................... 9
2.3 Processor recommendation for Head Node and Compute Nodes ................................................................... 10
2.4 Memory recommendation for Head Node and Compute Nodes ...................................................................... 10
2.5 Isilon Storage .................................................................................................................................................... 11
2.6 Network ............................................................................................................................................................. 12
2.7 Software ............................................................................................................................................................ 13
3 Deep Learning Training and Inference Performance and Analysis ........................................................................... 14
3.1 Deep Learning Training .................................................................................................................................... 14
3.1.1 FP16 vs FP32 ................................................................................................................................................... 15
3.1.2 V100 vs P100 ................................................................................................................................................... 16
3.1.3 V100-SXM2 vs V100-PCIe ............................................................................................................................... 17
3.1.4 Scaling Performance with Multi-GPU ............................................................................................................... 18
3.1.5 Storage Performance ....................................................................................................................................... 21
3.2 Deep Learning Inference .................................................................................................................................. 28
3.3 NVIDIA DIGITS Tool and the Deep Learning Solution ..................................................................................... 30
4 Containers for Deep Learning .................................................................................................................................... 32
4.1 Singularity Containers ...................................................................................................................................... 32
4.2 Running NVIDIA GPU Cloud with the Ready Solutions for AI - Deep Learning .............................................. 34
5 The Data Scientist Portal ............................................................................................................................................ 38
5.1 Creating and Running a Notebook ................................................................................................................... 38
5.2 Tensorboard Integration ................................................................................................................................... 42
5.3 Slurm Scheduler ............................................................................................................................................... 43
6 Conclusions and Future Work .................................................................................................................................... 46