Use C/C++ to Offload Image Processing to Programmable Logic

2015年12月11日 | By News | Filed in: News.


By Olivier Tremois, DSP Specialist FAE, Xilinx

The “standard” image processing systems found today in medical, industrial and a growing number of other applications are becoming ever more advanced. Many existing platforms are struggling to meet such complex requirements. Luckily, design teams can leverage Xilinx Zynq-7000 All Programmable SoCs and the new Xilinx SDSoC development environment to create compact, low-power, feature-rich products with advanced imaging systems using C/C++. Let’s examine how to do this by using the SDSoC environment to accelerate an image pipeline processing system. I completed this project in less than a week and was able to accelerate the system example by orders of magnitude.

Our example system acquires images using a specific camera and then processes the images in batch mode. The image size can be up to 3,000 x 2,000 pixels (6 megapixels). Although the processed image is not live video, the intent is to send the images through the image pipeline as quickly as possible. The pipeline here is pretty simple: transform an RGB image into grayscale; add salt-and-pepper noise; and ­filter the noisy image with three fi­lters (dilate, median and erode).

As a starting point, we write the complete application in C++ so that we can estimate the performance of the computations on the Cortex-A9. The application contains a number of functions to read and write BMP images on the SD card, compute luminance, add noise and perform the various filter functions. Working within the SDSoC development environment’s SDDebug configuration will enable rapid implementation on the Xilinx ZC702 evaluation platform under the Linux operating system.

Our system example will show that programmable logic [used for acceleration] is good not only at brute-force computations, but also at more standard data processing. The ­first goal for this acceleration is to be able to process one new sample every clock cycle. Some code rewriting and a rethinking of the interface can yield greater acceleration. Even if the clock rate of the on-chip programmable logic (PL) is much lower than that of the processing system (PS), being able to process one input pixel per clock should provide great acceleration.

When I built the sample system, I first verified that the source code was Vivado HLS compliant and then added the VHLS directives. Using specific SDSoC directives, I specified that the data would be stored contiguously in the physical space (with memory allocated using the function sds_alloc) and that I wanted a DMA to access it. I then switched the build configuration to SDEstimate in order to have a first rough estimate of the acceleration that was achievable. I did not have to wait a long time for this step, because at this point no hardware had been built.

The SDSoC environment computes the speedup estimate from the processor runtime (computed using the hardware-adapted code, which is slower than the original, processor-adapted code, and with compiler optimization set to –O0) and the number of clock cycles (computed by VHLS as the latency of the hardware accelerator). This latency is the maximum latency of the hardware accelerator, so this estimation should be taken for what it is—a rough estimate. This acceleration is almost 700x for the hardware accelerator itself. There are many file accesses that take time at the “main” level; that’s why the overall acceleration is “only” 5x. In practice, we can choose the top-level function at which the global acceleration is computed so that we can obtain a more meaningful acceleration value.

The final step of the flow is to build the entire system. In this phase, all the accelerators are built and connected to the processor. The C++ source code is then modified in order to start and control these accelerators (instead of calling the original C function). At this stage, we are able to have an exact value of the acceleration obtained using the hardware accelerators, taking into account all the data transfers to and from DDR.

The time taken by the hardware accelerator is proportional to the size of the image and not to the size of the structuring element. That’s why the higher the number of active pixels in the structuring element, the higher the acceleration ratio will be. The figure below shows the amount of acceleration achieved. The latency referred to in the figure is that of the full image pipeline, containing the software and hardware elements. When I undertook this project, building the software application proved to be the longest phase.

SDSoC Acceleration for Image Processing.jpg

From there, it took less than 2 hours to modify the code so that I had fully compliant Vivado HLS code with the right directives in place to optimize the throughput. Given the size of the hardware part of this design (half the lookup table of the chip), the last stage—synthesis, place and route, bitstream, SD card—took more than 2 hours to complete.

The SDSoC environment’s integrated tools for system-level profiling, automated software acceleration in programmable logic and full-system-optimizing compilation—automatically generating the right connectivity to minimize memory access bottlenecks—allowed me to go through this example project in less than a week.

Note: This blog was adapted from an article by the same name that appeared in the recently published issue of Xcell Software Journal. To read the full article online, click here. To download a PDF of the entire issue, click here.


邮箱地址不会被公开。 必填项已用*标注