Administrator Guide

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Direct from Development

Server and Infrastructure

Engineering

Deep Learning Inference with Intel PAC on

Dell EMC Infrastructure – Part II

Recap

Before we dive into the details, let’s briefly go over what is deep

learning inference and why we may use a field-programmable

gate-array (FPGA) accelerator like the Intel Programmable

Accelerator Card (PAC) to speed up the process.

Deep learning is a class of machine learning that learns a neural

network model from sample data sets over a series of training

iterations and loss function [1]. The output of this phase, the

learned model, is then used to make predictions on new data.

While this model-learning process lends itself well to single-

instruction multiple data (SIMD) type computing with coarse-

grain architectures like many-core CPUs and GPUs, the

inferencing process is much more amenable to irregular, fine-

grain architectures like FPGAs, which allow for greater

architectural flexibility to meet specific application requirements:

latency, throughput, power, etc. Inferencing is the stage where

most enterprises realize the business value of their AI

investments.

To accelerate inferencing in resource-constrained servers, the

PCIe-based Intel PAC with Arria 10 GX FPGA supports up to 1.5

teraFLOPS (1 trillion floating-point instructions per second)

performance within a thermal dissipation power of only 60 Watts.

This makes the Intel PAC particularly suited for datacenters and

edge computing environments. Combined with the Intel

Acceleration Stack and the Intel distribution of OpenVINO,

developers can deploy models on the Intel PAC while leveraging

unique, built-in hardware features of the Intel PAC including

direct I/O and networking.

Figure 1a: Intel PAC with Arria 10 GX FPGA

Tech Note by

David Ojika

(University of Florida)

Bhavesh Patel

(Dell EMC)

Shawn Slockers

(Intel)

Summary

This is the second part of a

three-part series on deep

learning inferencing on

FPGAs. In part 1, we

presented the Intel PAC

with Intel Acceleration

Stack integrated with a Dell

EMC PowerEdge server

running image classification

Here, in part 2, we

demonstrate the

performance of the newly

improved ResNet-50 image

classification model.

Summary of content (6 pages)

PAGE 1
Direct from Development Server and Infrastructure Engineering Deep Learning Inference with Intel PAC on Dell EMC Infrastructure – Part II Tech Note by David Ojika (University of Florida) Bhavesh Patel (Dell EMC) Shawn Slockers (Intel) Summary This is the second part of a three-part series on deep learning inferencing on FPGAs.
PAGE 2
Copyright © 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries Figure 1b: Dell EMC R740 PowerEdge server Recent Optimizations In part 1 of this blog, we presented the Intel Acceleration Stack and the Intel distribution of OpenVINO both of which are part of the system stack for the Intel PAC. Within the PAC, the Deep Learning Accelerator (DLA), part of the PAC hardware stack, accelerates specific deep learning models, e.g.
PAGE 3
Copyright © 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries The ResNet-50 model has 150,528 input neurons; 1,000 output neurons and 50 layers, totaling 3.8 billion operations. With recent improvements in the OpenVINO SDK, the Intel PAC with Arria 10 FPGA can comfortably run this ResNet-50 model at increased performance compared to previously published performance numbers.
PAGE 4
Copyright © 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries Latency Latency is the time it takes to process a request, i.e., inferencing. Here, a request has a batch size of 1, 16, or 64 as shown in Fig 5. A batch size of 1 is applicable to streaming applications where data must be processed on-the-fly as it is generated. In such a scenario, latency is more important than throughput.
PAGE 5
Copyright © 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries Efficiency Performance efficiency is the throughput achieved, normalized by power consumed. As we saw in Fig 4, doubling the number of threads from 32 (i.e., SS) to 64 (i.e., DS) does not have as much impact as going from HS to SS and results in throughput increase by only 11%.
PAGE 6
Copyright © 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries Conclusions We presented the Intel Programmable Accelerator Card (PAC) with Arria 10 FPGA for deeplearning inference. We showed that, using the Intel PAC on an x86-based Dell EMC PowerEdge server, we achieved improved performance of ResNet-50 compared to previously released ResNet-50 model.