White Papers

Dell - Internal Use - Confidential
Deep Learning Inference on P40 vs P4 with Skylake
Authors: Rengan Xu, Frank Han and Nishanth Dandapanthula. Dell EMC HPC Innovation Lab. July. 2017
This blog evaluates the performance, scalability and efficiency of deep learning inference on P40 and P4
GPUs on Dell EMC’s PowerEdge R740 server. The purpose is to compare P40 versus P4 in terms of
performance and efficiency. It also measures the accuracy differences between high precision and
reduced precision floating point in deep learning inference.
Introduction to R740 Server
The PowerEdge
TM
R740 is Dell EMC’s latest generation 2-socket, 2U rack server designed to run complex
workloads using highly scalable memory, I/O, and network options. The system features the Intel Xeon
Processor Scalable Family (architecture codenamed Skylake-SP), up to 24 DIMMs, PCI Express (PCIe) 3.0
enabled expansion slots, and a choice of network interface technologies to cover NIC and rNDC. The
PowerEdge R740 is a general-purpose platform capable of handling demanding workloads and
applications, such as data warehouses, ecommerce, databases, and high performance computing (HPC).
It supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs.
Introduction to P40 and P4 GPUs
NVIDIA® launched Tesla® P40 and P4 GPUs for the inference phase of deep learning. Both GPU models
are powered by NVIDIA Pascal
TM
architecture and designed for deep learning deployment, but they have
different purposes. P40 is designed to deliver maximum throughput, while P4’s is aimed to provide better
energy efficiency. Aside from high floating point throughput and efficiency, both GPU models introduce
two new optimized instructions designed specifically for inference computations. The two new
instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot
product (DP2A) instructions. Although many HPC applications require high precision computation with
FP32 (32-bit floating point) or FP64 (64-bit floating point), deep learning researchers have found using
FP16 (16-bit floating point) is able to achieve the same inference accuracy as FP32 and many applications
only require INT8 (8-bit integer) or lower precision to keep an acceptable inference accuracy. Tesla P4
delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per Second), while P40 delivers a peak of
47.0 INT8 TIOP/s. Other differences between these two GPU models are shown in Table 1. This blog uses
both types of GPUs in the benchmarking.
Table 1: Comparison between Tesla P40 and P4
Tesla P40
Tesla P4
CUDA Cores
3840
2560
Core Clock
1531 MHz
1063 MHz
Memory Bandwidth
346 GB/s
192 GB/s
Memory Size
24 GB GDDR5
8 GB GDDR5
FP32 Compute
12.0 TFLOPS
5.5 TFLOPS
INT8 Compute
47 TIOPS
22 TIOPS
TDP
250W
75W

Summary of content (5 pages)