Reference Guide

28 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0

There are ongoing studies to further stress the storage subsystems with other models (model parallelism like

seq2seq) and datasets to understand the performance implications of deep learning workloads at scale.

3.2 Deep Learning Inference

Inference is the end goal of Deep Learning. The inference performance tends to be either latency-focused or

throughput-focused. On one hand, latency-focused scenarios are time sensitive (e.g. face recognition), with

time to solution taking priority over efficient hardware utilization. In such case, the batch size can be as small

as one. On the other hand, in cases where delayed batch processing is acceptable large batch sizes can be

used to increase throughput.

This section quantifies the performance of Deep Learning inference using NVIDIA's TensorRT library.

TensorRT, previously called GIE (GPU Inference Engine), is a Deep Learning inference engine for production

deployment of Deep Learning applications which aims to maximize inference throughput and efficiency.

TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the

P100 and V100 GPUs.

In the experiments conducted in this study, TensorRT 4.0.0.3 was used. As described in Section 3.1.1, deep

learning researchers have found they are able to achieve the same inference accuracy with FP16 (16-bit floating

point) as can be had with FP32. Many applications require only INT8 (8-bit integer) or lower precision to keep

an acceptable inference accuracy. TensorRT included support for INT8 operations in version 2 of the software.

All inference experiments were performed on a single V100 GPU. Multi-GPU results were not included since

most inference jobs are large embarrassingly parallel batch jobs and there is not much communication between

the GPU. Therefore for most use cases linear scalability can be expected when using multi-GPUs for

inferencing.

Figure 15 shows the inference performance with TensorRT on Resnet50 model with different batch sizes. At

the time of writing, there is a known issue on V100 GPUs where running a model with INT8 works only if the

batch size is evenly divisible by 4, so there is no resultant performance value when the batch size is 1. The

results show that the INT8 mode is 2.5x 3.5x faster than using FP32 when the batch size is less than 64, and

~3.7x faster when the batch size is greater than 64. This is expected since the theoretical speedup of INT8 is

4x compared to FP32 if only multiplications are performed and no other overhead is incurred. However, there

are kernel launches, occupancy limits, data movement and mathematical operations other than multiplications,

so the speedup is reduced to about 3x faster.