User manual

ManualsBrandsDIGILENT ManualsDev KitsPCB design board

641

642

643

644

645

646

647

648

649

650

Table Of Contents

Zynq-7000 All Programmable SoC

Zynq-7000 AP SoC Technical Reference Manual www.xilinx.com 649

UG585 (v1.11) September 27, 2016

Chapter 22: Programmable Logic Design Guide

22.2.3 PL Acceleration Limits

The achievable speedup of an accelerator can be limited by I/O, resource, and latency requirements.

I/O Rate Limits

A key observation is that processing cannot proceed faster than the speed of the data transfers to

and from the functional unit. For functions with a high ratio of operations to I/O the data rates are

not a limiting factor, however for operations with a low ratio of operations to I/O, dataflow limits the

maximum performance attainable.

For example, assume 12 bytes of input data is being read from DDR and 4 bytes of results are written

back to DDR. DDR3 at 32-bits, 1,066 Mb/s and 75% utilization is limited to roughly 3,200 Mb/s. If 16

bytes are required per operation, the dataflow limits the performance to 3,200/16, or 200M

function/sec. Note that this is independent of the complexity of the function. Even a simple 3 input

adder is limited by the DDR bandwidth to 200M operations/sec and is not likely to be faster than an

ARM A9 CPU. If however, the function consists of several thousands of operations all of which can

proceed in parallel, or in a pipelined fashion, then the PL can often achieve speedups of a 10-100x.

Resource Limits

While potential speedup can be quite high, the amount of logic in the PL can limit the achievable

speedup. For instance, an application which requires 100 DSP slices to achieve a speedup of 24x

might be limited to a 12x speedup if only 50 DSP slices are available.

Latency Limits

The examples above assume that the PL can effectively proceed without intervention by the ARM

processor. This is the case in situations where the PL implements a predetermined algorithm and

dataflow using pre-allocated buffers and data is not resident in caches. In cases where the processor

is creating data for the PL accelerator, additional CPU tasks might be required before the PL can

begin working on the data. The CPU might need to allocate buffers and pass physical buffer

addresses to the PL, or data might be flushed from cache to DDR or OCM or signal the PL to start

processing. These additional steps add delays (called latency) to the total processing time. If these

delays are significant, the potential acceleration is reduced. Typically it takes 100-200 clocks for the

ARM processor to write a few words of data to a PL function. In general, CPU to PL calling latency is

not a significant impact for applications processing more than 4 KB of data.

22.2.4 Power Offload

The PL can be used to implement individual functions at lower energy cost than when executed on

the ARM A9 application processors. Less energy per operation is required because when a function

is implemented in the PL, data is transferred from operator to operator in a local assembly line

fashion using short, low capacitance local connections.

The same function implemented on a processor requires an instruction and data fetch from local

caches or external memory and a result to be written back to registers or the memory system over

longer, higher capacitance interfaces. When functions require data to be stored in memory, block