White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell EMC Technical White Paper

Whitepaper

Deep Learning Inferencing with Mipsology using

Xilinx ALVEO™ on Dell EMC Infrastructure

Abstract

This blog evaluates throughput, efficiency and ease-of-use of Deep Learning inference

performed by Mipsologys’ Zebra software stack running on FPGA-based Xilinx ALVEO™ U200

installed in a Dell EMC PowerEdge R740/R740xd server. The objective is to show how the

Zebra stack from Mipsology can deliver high inferencing throughput without requiring any effort.

Revision: 1.1

Issue Date: 11/5/2019

Summary of content (13 pages)

PAGE 1
Whitepaper Deep Learning Inferencing with Mipsology using Xilinx ALVEO™ on Dell EMC Infrastructure Revision: 1.1 Issue Date: 11/5/2019 Issue Date: 11/5/2019 Abstract This blog evaluates throughput, efficiency and ease-of-use of Deep Learning inference performed by Mipsologys’ Zebra software stack running on FPGA-based Xilinx ALVEO™ U200 installed in a Dell EMC PowerEdge R740/R740xd server.
PAGE 2
Revisions Date Description 24 October 2019 Initial release Acknowledgements This paper was produced by the following people: Name Role Bhavesh Patel Server Advanced Engineering, Dell EMC Ludovic Larzul CEO, Mipsology 2 Deep Learning Inferencing with Mipsology using Xilinx ALVEO™ on Dell EMC Infrastructure
PAGE 3
Overview of Deep Learning The deployment of a Deep Learning (DL) algorithm proceeds in two stages: training and inference. As illustrated in Figure 1, training configures the parameters of a neural network model implementing an algorithm via a learning process based on a large dataset over several training iterations and loss function [1]; the larger the dataset, the higher the accuracy of the model. The output of this stage, the learned model, is then used in the inference stage to speculate on new data.
PAGE 4
Figure 2. Inference Flow. The hardware accelerator must meet a set of basic requirements: • • • • • Deliver high throughput of computations with low latency. Support a neural network as defined by AI scientists without changes to avoid a timeconsuming re-design or never-ending training. Adaptable to accommodate different loads without long delay to execute different models. Be a proven hardware solution that can run 24/7 without interruptions.
PAGE 5
Why FPGA? FPGAs achieve high computation throughput via a robust set of resources that comprises substantial reprogrammable lookup tables (LUT) to implement millions of equivalent Booleanlogic functions, a large assembly of multipliers/adders (MAC), numerous embedded memories to accommodate a broad variety of logic circuitry. They can also support a high number of offchip memories if necessary. A series of auxiliary logic, such I/O interfaces, etc., complete the device.
PAGE 6
Figure 4. Xilinx ALVEO™ U200 data center accelerator card Diagram. Figure 5. U200 Block Zebra Acceleration Stack from Mipsology Mipsology, an innovative AI/DL startup, conceived a software stack called Zebra for ultra-fast inference acceleration of CNN. Based on FPGAs, Zebra sits on top of the FPGAs and conceals them to the user. The Mipsology Zebra stack outshines its competitors with several advantages: exceptional throughput, very low latency, very high efficiency, and remarkable ease-of-use.
PAGE 7
the performance and free the CPU. To accommodate the progress of ML technology, Mipsology R&D expands the acceleration to new layers on a regular basis. When in the course of a project a CNN evolves, it can be processed on Zebra on-the-spot once trained, drastically simplifying the deployment of new CNN versions in data centers, at the edge, on a desktop, or in embedded applications.
PAGE 8
Figure 6. Mipsology’ s Zebra stack Zebra Is Easy to Use Deploying Zebra is a “plug play” process. Plug an ALVEO™ Board into a PC running Linux, issue one single Linux command to configure it, and you are ready to go. See figure 6. There is no R&D cost in using Zebra. No extra work is required to make Zebra compute a neural network nor is any proprietary tool needed to understand how to migrate the neural network.
PAGE 9
There is no additional R&D effort to run the same trained neural network on various sizes of Zebra accelerators or in different locations, from data center to embedded. If the size of a neural network grows during the life-time of the product, a simple upgrade of the hardware, keeping the same Zebra stack, will accommodate the increased computational power of a larger neural network.
PAGE 10
Here are few examples of applications: Video surveillance Video surveillance can be implemented via multiple cameras with a single ALVEO/Zebra combo installed in a Dell PowerEdge R740 server. All cameras would run concurrently in real time, not sequentially as on GPU boards. In the case of GPUs, the above scenario would need multiple GPU boards or significant R&D effort, dramatically increasing the cost and complicating the deployment.
PAGE 11
An ALVEO/Zebra combo installed in a Dell EMC PowerEdge R740 server can be used for generating high-quality high-resolution images from low res images by mapping a very deep super-resolution (VDSR) algorithm onto Zebra. Super resolution algorithms are particularly demanding in processing power as they must generate high resolution images. On a Dell PowerEdge R740, Zebra running on ALVEO™ can deliver real time video VDSR movie.
PAGE 12
Evaluation Results Ease of use Switching from CPU/GPU to Zebra running on an ALVEO™ board deployed for inference was surprisingly simple, carried out via a single Linux command. No FPGA tools or knowledge was necessary. During the evaluation, eight neural networks were executed without any changes, proving that the Zebra/ALVEO U200 is the most versatile FPGA solution for neural networks. The application was based on TensorFlow workflow.
PAGE 13
Table I summarizes the accuracy achieved by Zebra based on int8 computations (obtained from FP32 training and using Zebra quantization) compared to the accuracy obtained by a GPU/CPU platform based on FP32 computations. The results were obtained with the same networks and the same application, without modifications. Table I.