White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell - Internal Use - Confidential

Deep Learning Inference on P40 vs P4 with Skylake

Authors: Rengan Xu, Frank Han and Nishanth Dandapanthula. Dell EMC HPC Innovation Lab. July. 2017

This blog evaluates the performance, scalability and efficiency of deep learning inference on P40 and P4

GPUs on Dell EMC’s PowerEdge R740 server. The purpose is to compare P40 versus P4 in terms of

performance and efficiency. It also measures the accuracy differences between high precision and

reduced precision floating point in deep learning inference.

Introduction to R740 Server

The PowerEdge

R740 is Dell EMC’s latest generation 2-socket, 2U rack server designed to run complex

workloads using highly scalable memory, I/O, and network options. The system features the Intel Xeon

Processor Scalable Family (architecture codenamed Skylake-SP), up to 24 DIMMs, PCI Express (PCIe) 3.0

enabled expansion slots, and a choice of network interface technologies to cover NIC and rNDC. The

PowerEdge R740 is a general-purpose platform capable of handling demanding workloads and

applications, such as data warehouses, ecommerce, databases, and high performance computing (HPC).

It supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs.

Introduction to P40 and P4 GPUs

NVIDIA® launched Tesla® P40 and P4 GPUs for the inference phase of deep learning. Both GPU models

are powered by NVIDIA Pascal

architecture and designed for deep learning deployment, but they have

different purposes. P40 is designed to deliver maximum throughput, while P4’s is aimed to provide better

energy efficiency. Aside from high floating point throughput and efficiency, both GPU models introduce

two new optimized instructions designed specifically for inference computations. The two new

instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot

product (DP2A) instructions. Although many HPC applications require high precision computation with

FP32 (32-bit floating point) or FP64 (64-bit floating point), deep learning researchers have found using

FP16 (16-bit floating point) is able to achieve the same inference accuracy as FP32 and many applications

only require INT8 (8-bit integer) or lower precision to keep an acceptable inference accuracy. Tesla P4

delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of

47.0 INT8 TIOP/s. Other differences between these two GPU models are shown in Table 1. This blog uses

both types of GPUs in the benchmarking.

Table 1: Comparison between Tesla P40 and P4

Tesla P40

Tesla P4

CUDA Cores

3840

2560

Core Clock

1531 MHz

1063 MHz

Memory Bandwidth

346 GB/s

192 GB/s

Memory Size

24 GB GDDR5

8 GB GDDR5

FP32 Compute

12.0 TFLOPS

5.5 TFLOPS

INT8 Compute

47 TIOPS

22 TIOPS

TDP

250W

75W

Summary of content (5 pages)

PAGE 1
Deep Learning Inference on P40 vs P4 with Skylake Authors: Rengan Xu, Frank Han and Nishanth Dandapanthula. Dell EMC HPC Innovation Lab. July. 2017 This blog evaluates the performance, scalability and efficiency of deep learning inference on P40 and P4 GPUs on Dell EMC’s PowerEdge R740 server. The purpose is to compare P40 versus P4 in terms of performance and efficiency. It also measures the accuracy differences between high precision and reduced precision floating point in deep learning inference.
PAGE 2
Introduction to NVIDIA TensorRT NVIDIA TensorRTTM, previously called GIE (GPU Inference Engine), is a high performance deep learning inference engine for production deployment of deep learning applications that maximizes inference throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the Pascal GPUs.
PAGE 3
plotted as a “performance per watt” metric. The power consumption was measured by subtracting the power when the system was idle from the power when running the inference. Both the images/sec and images/sec/watt metrics numbers are relative to the numbers on one P40. Figure 3 shows the performance with different batch sizes with 1 GPU, and both metrics numbers are relative to the numbers on P40 with batch size 1. In all figures, INT8 operations were used.
PAGE 4
P40 vs P4 for GoogLeNet on R740 (batch_size=128) 1.51 1.44 3.00 1.46 1.60 1.41 1.40 3.00 2.50 1.01 2.00 1.00 2.10 1.00 2.00 1.58 1.50 1.00 0.80 1.06 1.00 1.00 1.20 0.60 0.40 0.52 0.50 0.20 0.00 Relative Images/sec/Watt Relative Images/sec 3.50 0.00 1 GPU 2 GPUs Perf - P40 Perf P4 3 GPUs 4 GPUs Perf/Watt - P40 Perf/Watt - P4 Figure 2: The performance of inference with GoogLeNet on P40 and P4 P40 vs P4 for AlexNet (with 1 GPU) 14.7 Relative Images/sec 20.0 14.1 15.
PAGE 5
Inference with TensorRT” from GTC 2017. We used ILSVRC2012 validation dataset for both calibration and benchmarking. The validation dataset has 50,000 images and was divided into batches where each batch has 25 images. The first 50 batches were used for calibration purpose and the rest of the images were used for accuracy measurement. Several pre-trained neural network models were used in our experiments, including ResNet-50, ResNet-101, ResNet-152, VGG-16, VGG-19, GoogLeNet and AlexNet.