White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell - Internal Use - Confidential

Deep Learning Performance with P100 GPUs

Authors: Rengan Xu and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. October 2016

Introduction to Deep Learning and P100 GPU

Deep Learning (DL), an area of Machine Learning, has achieved significant progress in recent years. Its

application area includes pattern recognition, image classification, Natural Language Processing (NLP),

autonomous driving and so on. Deep learning attempts to learn multiple levels of features of the input

large data sets with multi-layer neural networks and make predictive decision for the new data. This

indicates two phases in deep learning: first, the neural network is trained with large number of input data;

second, the trained neural network is used to test/inference/predict the new data. Due to the large

number of parameters (the weight matrix connecting neurons in different layers and the bias in each layer,

etc.) and training set size, the training phase requires tremendous amounts of computation power.

To approach this problem, we utilize accelerators which include GPU, FPGA and DSP and so on. This blog

focuses on GPU accelerator. GPU is a massively parallel architecture that employs thousands of small but

efficient cores to accelerate the computational intensive tasks. Especially, NVIDIA® Tesla® P100™ GPU

uses the new Pascal™ architecture to deliver very high performance for HPC and hyperscale workloads. In

PCIe-based servers, P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision

performance, respectively. And in NVLink™-optimized servers, P100 delivers around 5.3 and 10.6

TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-

based servers. P100 is also equipped with High Bandwidth Memory 2 (HBM2) which offers higher

bandwidth than the traditional GDDR5 memory. Therefore, the high compute capability and high memory

bandwidth make GPU an ideal candidate to accelerate deep learning applications.

Deep Learning Frameworks and Dataset

In this blog, we will present the performance and scalability of P100 GPUs with different deep learning

frameworks on a cluster. Three deep learning frameworks were chosen: NVIDIA’s fork of Caffe (NV-Caffe),

MXNet and TensorFlow. Caffe is a well-known and widely used deep learning framework which is

developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. It focuses

more on the image classification and it supports multiple GPUs within a node but not across nodes. MXNet,

jointly developed by collaborators from multiple universities and companies, is a lightweight, portable

and flexible deep learning framework designed for both efficiency and flexibility. This framework scales

to multiple GPUs within a node and across nodes. TensorFlow, developed by Google’s Brain team, is a

library for numerical computation using data flow graphs. TensorFlow also supports multiples GPUs and

can scale to multiple nodes.

All of the three deep learning frameworks we chose are able to perform the image classification task. With

this in mind, we chose the well-known ImageNet Large Scale Visual Recognition Competition (ILSVRC)

2012 dataset. This training dataset contains 1281167 training images and 50000 validation images. All

images are grouped into 1000 categories or classes. Another reason we chose ILSVRC 2012 dataset is that

Summary of content (8 pages)

PAGE 1
Deep Learning Performance with P100 GPUs Authors: Rengan Xu and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. October 2016 Introduction to Deep Learning and P100 GPU Deep Learning (DL), an area of Machine Learning, has achieved significant progress in recent years. Its application area includes pattern recognition, image classification, Natural Language Processing (NLP), autonomous driving and so on.
PAGE 2
its workload is large enough to perform long time training and it is a benchmark dataset used by many deep learning researchers. Testing Methodology This blog quantifies the performance of deep learning frameworks using NVIDIA’s P100-PCIe GPU and Dell’s PowerEdge C4130 server architecture. Figure 1 shows the testing cluster. The cluster includes one head node which is Dell’s PowerEdge R630 and four compute nodes which are Dell’s PowerEdge C4130.
PAGE 3
cuDNN library Intel Compiler Python NV-Caffe Intel-Caffe MXNet TensorFlow Version 5.1.3 Version 2017.0.098 2.7.5 Deep Learning Frameworks Version 0.15.13 Version 1.0.0-rc3 Version 0.7.0 Version 0.11.0-rc2 We measured the metrics of both images/sec and training time. The images/sec is the measurement for training speed while the training time is the wall clock time for training, I/O operation and other overhead.
PAGE 4
The batch size is one of the hyperparameters the user needs to tune when training a neural network model with mini-batch Stochastic Gradient Descent (SGD). The batch size in the table are commonly used sizes. Whether these batch sizes are optimized for model accuracy is left in future work. For all neural networks in all frameworks, we increased the batch size proportionally with increasing number of GPUs.
PAGE 5
NV-Caffe GoogleNet ILSVRC12 2000 1755 6000 1800 5760 1600 5000 1400 1200 4000 894 1000 3000 800 468 2000 1000 600 400 1151 89 593 CPU 1 P100 2 P100 Training Speed 200 338 0 Images/sec (higher the better) Seconds (lower the better) 7000 0 4 P100 Training Time Figure 2: The training speed and time of GoogleNet in NV-Caffe using P100 GPUs Figure 3 and Figure 4 show the training speed and time of GoogleNet and Inception-BN neural networks in MXNet using P100 GPUs.
PAGE 6
Figure 3: The training speed and time of GoogleNet in MXNet using P100 GPUs MXNet Inception-BN ILSVRC12 2771 7000 6776 3000 2500 2270 6000 2000 5000 1513 4000 1500 3433 3000 1000 1000 743 2000 373 189 1726 847 0 1 P100 2 P100 4 P100 8 P100 Training Speed 500 566 12 P100 463 Images/sec (higher the better) Seconds (lower the better) 8000 0 16 P100 Training Time Figure 4: The training speed and time of Inception-BN in MXNet using P100 GPUs Figure 5 shows the training speed and t
PAGE 7
Figure 6 shows the speedup when using multiple P100 GPUs in different deep learning frameworks and neural networks. The purpose of this figure is to demonstrate the speedup in each framework when more number of GPUs are used. The purpose does not include the comparison among different frameworks since their input parameters were different. When using 4 P100 GPUs for NV-Caffe GoogleNet and TensorFlow Inception-V3, we observed a speedup up to 3.8x and 3.0x, respectively. For MXNet, using 16 P100 achieved 13.
PAGE 8
and GPU3 and GPU 4 have peer-to-peer accesses. In the future, we will try C4130 configuration B in which all of the four GPUs connected to one socket have peer-to-peer accesses and check the performance impact in this configuration. We will also investigate the impact of hyperparameters (e.g. batch size and learning rate) on both training performance and model accuracy. The reason of the slow training performance with TensorFlow on multiple nodes will also be examined.