In a paper published at the recent SC15 (accompanying poster here), Ashish Sirasao, Elliott Delaye, Ravi Sunkavalli, and Stephen Neuendorffer of Xilinx describe their use of the OpenCL language and the Xilinx SDAccel Design Environment to accelerate execution of the Smith-Waterman alignment algorithm, which is used for genome sequencing. Smith-Waterman algorithmic performance is measured in GCUPS (billions of cell updates per second) and, taking a quick shortcut to the reported result, the systolic array architecture implemented for this FPGA-accelerated Smith-Waterman algorithm and instantiated in a Xilinx Virtex-7 690T FPGA on an off-the-shelf Alpha Data ADM-PCIE-7V3 PCIe card runs:
- 9x faster with nearly 19x better performance/W than it does on a 12-core Intel X86 server CPU
- 6x faster with more than 21x better performance/W than it does on a 60-core Intel Xeon Phi MIC (Many Integrated Core Architecture) coprocessor
- 30% faster with nearly 12x better performance/W than it does on an nVidia Tesla K40 GPU with 2880 stream processors
Alpha Data ADM-PCIE-7V3 PCIe card based on a Xilinx Virtex-7 690T FPGA
Here are the Smith-Waterman performance results, taken from the SC15 poster:
Saying that these performance and performance/W results are significant is putting it mildly.
The diagram below from the SC15 poster shows why the Smith-Waterman algorithm is well-suited to a highly parallel systolic-processing approach:
Of course, large FPGAs like the Xilinx Virtex-7 690T have abundant parallel computing resources so they are adept at implementing highly parallel compute engines such as the systolic array needed to efficiently execute the Smith-Waterman algorithm.
The authors’ experiments with FPGA-based Smith-Waterman algorithm implementations were multi-dimensional. In one dimension, the experiments determined the optimal number of systolic cells per OpenCL kernel versus the number of instantiated kernel instances needed to obtain maximum algorithmic performance. In this implementation, that number turns out to be 32 systolic cells per OpenCL kernel based on numerical analysis of the results, as shown in the diagram below (taken from the poster).
Several more experimental dimensions are represented by performance and performance/W comparisons with the Smith-Waterman algorithm running on the 12-core Intel Xeon CPU, the 60-core Intel Xeon Phi MIC coprocessor, and the nVidia Tesla K40 GPU (as reviewed in the results table appearing a few paragraphs above).
Perhaps the most significant result however is not necessarily the FPGA implementation’s better performance or even the vastly superior performance/W but the ease-of-use result. This paper demonstrates how you can compile OpenCL code using SDAccel to successfully implement high-performance, low-power systolic arrays on FPGAs—something that was previously possible only by writing RTL code. It’s that sort of result that will put FPGA acceleration into more data centers more quickly than anything else.
Here’s a thumbnail image of the SC15 Poster, which capsulizes the information from the paper:
If this real-world example has piqued your curiosity about algorithmic FPGA-acceleration or SDAccel, you might want to read: