User manual

Table Of Contents
Zynq-7000 AP SoC Technical Reference Manual www.xilinx.com 649
UG585 (v1.11) September 27, 2016
Chapter 22: Programmable Logic Design Guide
22.2.3 PL Acceleration Limits
The achievable speedup of an accelerator can be limited by I/O, resource, and latency requirements.
I/O Rate Limits
A key observation is that processing cannot proceed faster than the speed of the data transfers to
and from the functional unit. For functions with a high ratio of operations to I/O the data rates are
not a limiting factor, however for operations with a low ratio of operations to I/O, dataflow limits the
maximum performance attainable.
For example, assume 12 bytes of input data is being read from DDR and 4 bytes of results are written
back to DDR. DDR3 at 32-bits, 1,066 Mb/s and 75% utilization is limited to roughly 3,200 Mb/s. If 16
bytes are required per operation, the dataflow limits the performance to 3,200/16, or 200M
function/sec. Note that this is independent of the complexity of the function. Even a simple 3 input
adder is limited by the DDR bandwidth to 200M operations/sec and is not likely to be faster than an
ARM A9 CPU. If however, the function consists of several thousands of operations all of which can
proceed in parallel, or in a pipelined fashion, then the PL can often achieve speedups of a 10-100x.
Resource Limits
While potential speedup can be quite high, the amount of logic in the PL can limit the achievable
speedup. For instance, an application which requires 100 DSP slices to achieve a speedup of 24x
might be limited to a 12x speedup if only 50 DSP slices are available.
Latency Limits
The examples above assume that the PL can effectively proceed without intervention by the ARM
processor. This is the case in situations where the PL implements a predetermined algorithm and
dataflow using pre-allocated buffers and data is not resident in caches. In cases where the processor
is creating data for the PL accelerator, additional CPU tasks might be required before the PL can
begin working on the data. The CPU might need to allocate buffers and pass physical buffer
addresses to the PL, or data might be flushed from cache to DDR or OCM or signal the PL to start
processing. These additional steps add delays (called latency) to the total processing time. If these
delays are significant, the potential acceleration is reduced. Typically it takes 100-200 clocks for the
ARM processor to write a few words of data to a PL function. In general, CPU to PL calling latency is
not a significant impact for applications processing more than 4 KB of data.
22.2.4 Power Offload
The PL can be used to implement individual functions at lower energy cost than when executed on
the ARM A9 application processors. Less energy per operation is required because when a function
is implemented in the PL, data is transferred from operator to operator in a local assembly line
fashion using short, low capacitance local connections.
The same function implemented on a processor requires an instruction and data fetch from local
caches or external memory and a result to be written back to registers or the memory system over
longer, higher capacitance interfaces. When functions require data to be stored in memory, block