White Papers

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
4.1.2 Long Test
The long tests were run to get throughput and the training time to reach certain accuracy
convergence. We used 90 epochs for training run. These tests were run using the maximum
number of GPUs supported by that server.
In the section below, we describe the setup used, and Table 1 gives an overall view on the test
Use Case The benchmark tests are targeting image classification with convolutional
neural networks models (CNNs).
Benchmark code TensorFlow Benchmarks scripts
Hardware Configuration Each server is configured based on its maximum GPU
Servers - The servers tested are PowerEdge R740, PowerEdge C4130, PowerEdge C4140
and non-Dell EMC 8x NVLink GPU server.
Frameworks TensorFlow for single node, and TensorFlow with Horovod library for
distributed training.
Performance The performance metrics used for comparison across servers is
throughput (images per second) and training time to reach top-5 accuracy and top-1
Training tests - We conducted two types of tests. 1- Short Tests: for each test, 10
warmup steps were done and then the next 100 steps were averaged. 2-Long Tests: to
get the training accuracy convergence, and elapsed training time.
Dataset ILSVRC2012
Software stack configuration The benchmarks were run under docker container
environment. See table 1 with details.
4.2 Throughput Testing
Workload application and model
Image classification with convolutional neural networks models
Benchmarks code
TensorFlow Benchmarks scripts
Servers Single Node
PowerEdge R740
PowerEdge C4140
PowerEdge C4140
Non Dell EMC 8x NVLink server
Servers Multi Node
(2 nodes, 4GPUs each)
PowerEdge C4140-K
PowerEdge C4140-K
PowerEdge C4140-M
TensorFlow for Single Mode
TensorFlow with Horovod library for Distributed Mode
Performance Metrics
Throughput images/second