White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Deep Learning Inference on P40 GPUs

Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Mar. 2017

Introduction to P40 GPU and TensorRT

Deep Learning (DL) has two major phases: training and inference/testing/scoring. The training phase

builds a deep neural network (DNN) model with the existing large amount of data. And the inference

phase uses the trained model to make prediction from new data. The inference can be done in the data

center, embedded system, auto and mobile devices, etc. Usually inference must respond to user request

as quickly as possible (often in real time). To meet the low-latency requirement of inference, NVIDIA®

launched Tesla® P4 and P40 GPUs. Aside from high floating point throughput and efficiency, both GPUs

introduce two new optimized instructions designed specifically for inference computations. The two new

instructions are 8-bit integer (INT8) 4-element vector dot product (DP4A) and 16-bit 2-element vector dot

product (DP2A) instructions. Deep learning researchers have found using FP16 is able to achieve the same

inference accuracy as FP32 and many applications only require INT8 or lower precision to keep an

acceptable inference accuracy. Tesla P4 delivers a peak of 21.8 INT8 TIOP/s (Tera Integer Operations per

Second), while P40 delivers a peak of 47.0 INT8 TIOP/s. This blog only focuses on P40 GPU.

TensorRT

, previously called GIE (GPU Inference Engine), is a high performance deep learning inference

engine for production deployment of deep learning applications that maximizes inference throughput and

efficiency. TensorRT provides users the ability to take advantage of fast reduced precision instructions

provided in the Pascal GPUs. TensorRT v2 supports the INT8 reduced precision operations that are

available on the P40.

Testing Methodology

This blog quantifies the performance of deep learning inference using TensorRT on Dell’s PowerEdge

C4130 server which is equipped with 4 Tesla P40 GPUs. Since TensorRT is only available for Ubuntu OS, all

the experiments were done on Ubuntu. Table 1 shows the hardware and software details. The inference

benchmark we used was giexec in TensorRT sample codes. The synthetic images which were filled with

random non-zero numbers to simulate real images were used in this sample code. Two classic neural

networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet (2014 ImageNet winner) which

is much deeper and complicated than AlexNet.

We measured the inference performance in images/sec which means the number of images that can be

processed per second. To measure the performance improvement of the current generation GPU P40, we

also compared its performance with the previous generation GPU M40. The most important goal of this

testing is to measure the inference performance in INT8 mode, compared to FP32 mode. P40 uses the

new Pascal architecture and supports the new INT8 instructions. The previous generation GPU M40 uses

Maxwell architecture and does not support INT8 instructions. The theoretical performance of INT8, FP32

in both M40 and P40 is shown in Table 2. We measured the performance FP32 on both devices and both

FP32 and INT8 on the P40.

Summary of content (6 pages)

PAGE 1
Deep Learning Inference on P40 GPUs Authors: Rengan Xu, Frank Han and Nishanth Dandapanthu. Dell EMC HPC Innovation Lab. Mar. 2017 Introduction to P40 GPU and TensorRT Deep Learning (DL) has two major phases: training and inference/testing/scoring. The training phase builds a deep neural network (DNN) model with the existing large amount of data. And the inference phase uses the trained model to make prediction from new data.
PAGE 2
Table 1: Hardware configuration and software details Platform PowerEdge C4130 (configuration G) Processor 2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell) Memory 256GB DDR4 @ 2400MHz Disk 400GB SSD GPU 4x Tesla P40 with 24GB GPU memory Software and Firmware Operating System Ubuntu 14.04 BIOS 2.3.3 CUDA and driver version 8.0.44 (375.20) TensorRT Version 2.0 EA Table 2: Comparison between Tesla M40 and P40 Tesla M40 Tesla P40 INT8 (TIOP/s) N/A 47.0 FP32 (TFLOP/s) 6.8 11.
PAGE 3
Images/sec (higher is better) TensorRT Inference Performance (batch size=128) 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 16292 6410 5198 2215 GoogLeNet AlexNet FP32 INT8 Figure 1: Inference performance with TensorRT library Dell’s PowerEdge C4130 supports up to 4 GPUs in a server. To make use of all GPUs, we implemented the inference benchmark using MPI so that each MPI process runs on each GPU.
PAGE 4
TensorRT AlexNet on Multi-GPU (INT8, batch size=128) 64691 4.5 3.98 60000 3.5 50000 3.0 40000 2.5 32386 1.99 30000 20000 4.0 2.0 1.5 16270 1 Speedup Images/sec (higher is better) 70000 1.0 10000 0.5 0 0.0 1 P40 2 P40 4 P40 Number of GPUs Figure 3: Multi-GPU inference performance with TensorRT AlexNet To highlight the performance advantage of P40 GPU and its native support for INT8, we compared the inference performance between P40 with the previous generation GPU M40.
PAGE 5
Images/sec (higher is better) P40 vs M40 for AlexNet with TensorRT (batch size=128) 20000 16292 15000 10000 5198 3200 5000 0 FP32 INT8 Operations mode M40 P40 Figure 5: Inference performance comparison between P40 and M40 Deep learning inference can be applied in different scenarios. Some scenarios require large batch size and some scenarios even requires no batching at all (i.e. batch size is 1).
PAGE 6
Conclusions and Future Work In this blog, we presented the inference performance in deep learning with NVIDIA® TensorRT library on P40 and M40 GPUs. As a result, the INT8 support in P40 is about 3x faster than FP32 mode in P40 and 4.4x faster than FP32 mode in the previous generation GPU M40. Multiple GPUs can increase the inferencing performance linearly because of no communications and synchronizations.