White Papers

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
50
8 Conclusion and Future Work
PowerEdge C4140 using Nvidia 4x NVLink architecture scales relatively well when using
Uber Horovod distributed training library and Mellanox InfiniBand RDMA as the high-
speed link between nodes.
Table 5 shows that PowerEdge C4140 in multi-node configuration for most widely used
model ResNet-50 is within 7.8% of single node Non-Dell EMC 8x-NVLink system. But
with C4140-M in multi-node out performs single node 8x NVLink by at least 18% using
ResNet-50. The only disclaimer is that C4140-M results are using the latest version of
NCCL & TensorFlow containers.
There is lot of performance improvement being added continuously either at the GPU
level, library level or framework level. We are continuously looking at how we can
improve our performance results by experimenting with different hyper parameters.
Some of our future work in this area will be related to exploring the latest software
optimizations being released by Nvidia and looking at fast.ai library where Jeremy
Howard and researchers at fast.ai achieved training time of 3 hours on 8x V100 on
ResNet-50.