Reference Guide

28 Dell EMC Ready Solutions for AI Deep Learning with NVIDIA | v1.0
There are ongoing studies to further stress the storage subsystems with other models (model parallelism like
seq2seq) and datasets to understand the performance implications of deep learning workloads at scale.
3.2 Deep Learning Inference
Inference is the end goal of Deep Learning. The inference performance tends to be either latency-focused or
throughput-focused. On one hand, latency-focused scenarios are time sensitive (e.g. face recognition), with
time to solution taking priority over efficient hardware utilization. In such case, the batch size can be as small
as one. On the other hand, in cases where delayed batch processing is acceptable large batch sizes can be
used to increase throughput.
This section quantifies the performance of Deep Learning inference using NVIDIA's TensorRT library.
TensorRT, previously called GIE (GPU Inference Engine), is a Deep Learning inference engine for production
deployment of Deep Learning applications which aims to maximize inference throughput and efficiency.
TensorRT provides users the ability to take advantage of fast reduced precision instructions provided in the
P100 and V100 GPUs.
In the experiments conducted in this study, TensorRT 4.0.0.3 was used. As described in Section 3.1.1, deep
learning researchers have found they are able to achieve the same inference accuracy with FP16 (16-bit floating
point) as can be had with FP32. Many applications require only INT8 (8-bit integer) or lower precision to keep
an acceptable inference accuracy. TensorRT included support for INT8 operations in version 2 of the software.
All inference experiments were performed on a single V100 GPU. Multi-GPU results were not included since
most inference jobs are large embarrassingly parallel batch jobs and there is not much communication between
the GPU. Therefore for most use cases linear scalability can be expected when using multi-GPUs for
inferencing.
Figure 15 shows the inference performance with TensorRT on Resnet50 model with different batch sizes. At
the time of writing, there is a known issue on V100 GPUs where running a model with INT8 works only if the
batch size is evenly divisible by 4, so there is no resultant performance value when the batch size is 1. The
results show that the INT8 mode is 2.5x 3.5x faster than using FP32 when the batch size is less than 64, and
~3.7x faster when the batch size is greater than 64. This is expected since the theoretical speedup of INT8 is
4x compared to FP32 if only multiplications are performed and no other overhead is incurred. However, there
are kernel launches, occupancy limits, data movement and mathematical operations other than multiplications,
so the speedup is reduced to about 3x faster.