By Adam Taylor
With the FIR filter all up and running in software as desired on the PS (Processor System) side of our Zynq SoC, we will now proceed to accelerate the function using the PL (Programmable Logic) side of the device.
To get the best performance we need to ensure that:
- We use DMA data movers between the PS and the PL – To achieve this we often need contiguous memory allocations. We can achieve this using the sds_alloc() function available within sys_lib.h. We can also use pragmas to define the interface type provided any prerequisites are met. If we do not use pragmas the SDS compiler will select the most appropriate data mover.
- We pipeline / unroll loops as appropriate – Pipelining allows instructions within a loop to be implemented concurrently. We define pipelining using a pragma with a parameter called the iteration interval, which defines the target number of clock cycles between commands. Loop unrolling creates multiple copies of the contents of the loop. Choosing whether to pipeline (and selecting an iteration interval), and whether or not to unroll a loop depends upon data interdependencies within the loop.
- We have correctly segmented any memory arrays within the implementation – Selecting the correct segmentation allows us to ensure that we maximize available memory bandwidth, which increases the performance of our accelerated function. Like most SDSoC commands, we do this using a pragma.
- Select the best clock rates for the Data Mover Network and the accelerated function itself – These clock rates were designated in our platform definition in Vivado.
Before I implemented any optimizations within the filter, I wanted to initially determine its initial performance. Just running the bare results with no optimization resulted in a 36.7% performance increase. That’s not bad. However, we can do better.
The next step was implementing the optimization. To minimise the build time, I used the SDR Estimate build to chart the improvements as I fine-tuned my pragmas. Using the above four points to get the best performance, I ensured that the memory allocation for the samples being transferred in to the accelerated function were contiguous. The FIR filter is implemented as two loops: an inner loop that applies the filter and an outer loop that cycles through the sample buffer. There is obvious data dependency between these loops but we can still pipeline them to reduce the initial iteration interval. I segmented the samples and coefficients completely to achieve maximum memory bandwidth.
The final step was to define the clocks for the data mover and the accelerated function. Putting all of this together results in a significant improvement. The total execution time was 54696 clock cycles, an 89.78% improvement. This should come as no surprise as FPGA fabric is especially good at implementing FIR filters, using hard macros like the DSP48E DSP Slice.
When I ran the accelerated function on the ZedBoard, I again captured the filter input and output for a signal within the passband and the same for a signal within the stopband. You can see the results below:
Due to the holiday period, this is the last MicroZed Chronicles blog of 2015. They will resume in 2016 so please check after your New Year celebration. Until then, have a Merry Christmas and Happy New year.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
You also can find links to all the previous MicroZed Chronicles blogs on my own Web site, here.