White Papers

Deep Learning Performance: Scale-up vs Scale-out

Architectures & Technologies Dell EMC | Infrastructure Solutions Group

8 Conclusion and Future Work

• PowerEdge C4140 using Nvidia 4x NVLink architecture scales relatively well when using

Uber Horovod distributed training library and Mellanox InfiniBand RDMA as the high-

speed link between nodes.

• Table 5 shows that PowerEdge C4140 in multi-node configuration for most widely used

model ResNet-50 is within 7.8% of single node Non-Dell EMC 8x-NVLink system. But

with C4140-M in multi-node out performs single node 8x NVLink by at least 18% using

ResNet-50. The only disclaimer is that C4140-M results are using the latest version of

NCCL & TensorFlow containers.

• There is lot of performance improvement being added continuously either at the GPU

level, library level or framework level. We are continuously looking at how we can

improve our performance results by experimenting with different hyper parameters.

• Some of our future work in this area will be related to exploring the latest software

optimizations being released by Nvidia and looking at fast.ai library where Jeremy

Howard and researchers at fast.ai achieved training time of 3 hours on 8x V100 on

ResNet-50.