White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Ready Solutions Engineering Test Results

Deep Learning Performance with Intel

Caffe –

Training, CPU model choice and Scalability

Authors: Alex Filby and Nishanth Dandapanthula.

HPC Engineering, HPC Innovation Lab, March 2018

To get the most out of deep learning technologies requires careful attention to both hardware and software considerations. There are a

myriad of choices for compute, storage and networking. The software component does not stop at choosing a framework, there are

many parameters for a particular model that can be tuned to alter performance. The Dell EMC Deep Learning Ready Bundle with Intel

provides a complete solution with tuned hardware and software. This blog covers some of the benchmarks and results that influenced

the design. Specifically we studied the training performance across different generations of servers/CPUs, and the scalability of Intel

Caffe to hundreds of servers.

Introduction to Intel

Caffe and Testing Methodology

Intel Caffe is a fork of BVLC (Berkeley Vision and Learning Center) Caffe, maintained by Intel. The goal of the fork is to provide

architecture specific optimizations for Intel CPUs (Broadwell, Skylake, Knights Landing, etc). In addition to Caffe optimization, "Intel

optimized" models are also included with the code. These take popular models such as Alexnet, Googlenet, Resnet-50 and tweak their

hyperparamters to provide increased performance and accuracy on Intel systems for both single node and multi-node runs. These

models are frequently updated as the state of the art advances.

For these tests we chose the Resnet-50 model, due to its wide availability across frameworks for easy comparison, and since it is

computationally more intensive than other common models. Resnet is short for Residual Network, which strives to make deeper

networks easier to train and more accurate by learning the residual function of the underlying data set as opposed to the identity

mapping. This is accomplished by adding “skip connections” that pass output from upper layers to lower ones and skipping over some

number of intervening layers. The two outputs are then added together in an element wise fashion and passed into a nonlinearity

(activation) function.

Table 1. Hardware configuration for Skylake, Knights Landing and Broadwell nodes

SKL

KNL

BDW

Platform

Single node tests on PowerEdge

C6420, R740, R640

Cluster tests on PowerEdge C6420

PowerEdge C6320p

PowerEdge C6320

CPU

Multiple CPU models (see results)

Intel Xeon 7230

Intel Xeon E5-2697v4

RAM

192GB DDR4 @ 2666 MT/s

96GB DDR4 @ 2400 MT/s

128GB DDR4 @ 2400MT/s

Interconnect

Intel

Omni-Path

Intel

Omni-Path

Intel

Omni-Path

Memory Mode (KNL Only)

N/A

Cache

N/A

Summary of content (5 pages)

PAGE 1
Ready Solutions Engineering Test Results Deep Learning Performance with Intel® Caffe – Training, CPU model choice and Scalability Authors: Alex Filby and Nishanth Dandapanthula. HPC Engineering, HPC Innovation Lab, March 2018 To get the most out of deep learning technologies requires careful attention to both hardware and software considerations. There are a myriad of choices for compute, storage and networking.
PAGE 2
Ready Specs Table 2. 2 Software details Software OS RHEL 7.3 x86_64 Linux Kernel 3.10.0-514.el7.x86_64 BIOS 1.2.11 Intel Caffe 1.0.4 Intel MLSL (for multi-node tests) 2017.1.016 Caffe Model intel_optimized/multinode/resnet_50_256_nodes_8k_batch (with batch size modified) Performance tests were conducted on three generations of servers supporting different Intel CPU technology. The system configuration of these test beds is shown in Table 1 and the software configuration is listed in Table 2.
PAGE 3
Ready Specs Figure 1. 3 Processor model performance comparison relative to Broadwell The difference in performance between the Gold 6148 and Platinum 8168 SKUs is around 5%. These results show that for this workload and version of Intel Caffe the higher end Platinum SKUs do not offer much in the way of additional performance over the Gold CPUs. The KNL processor model tested provides very similar results to the Platinum models.
PAGE 4
Ready Specs Figure 2. 4 Scaling on Zenith with Gold 6148 processors using /dev/shm as the storage Figure 2 shows the results of our scalability tests on Skylake. When scaling from 1 node to 128 nodes, speedup is within 90% of perfect scaling. Above that scaling starts to drop off more rapidly, falling to 83% and 76% of perfect for 256 and 314 nodes respectively. This is mostly likely due to a combination of factors the first being decreasing node batch size.
PAGE 5
Ready Solutions Engineering Test Results Figure 3. 5 Scaling Xeon 7230 KNL using Dell NFS Storage Solution The scalability results on the KNL cluster are shown in Figure 3. The results are similar to SKL results in Figure 2. For this test, batch size was able to remain constant due to the smaller number of nodes and the fact that a smaller batch size was optimal on KNL systems.