A new and yet-to-be-published 12-page paper submitted to the IEEE Transactions on VLSI Systems titled “Efficient FPGA Mapping of Pipeline SDF FFT Cores” (available on IEEE Xplore) contains a thorough, detailed discussion of ways to map SDF (single-path delay feedback) FFT implementations into the DSP48 slices, programmable logic, and memory resources available on Xilinx All Programmable devices. The paper deals with Virtex-4 and Virtex-6 FPGAs but the authors note: 7 series “and UltraScale/UltraScale+ FPGAs from Xilinx use virtually the same slice architecture as Virtex-6, so… the results should be very easy to generalize.”
There’s been a steady evolution of the DSP48 slice in the multiple generations of Xilinx All Programmable devices starting with the Virtex-4 FPGAs. The Virtex-4 FPGA series included XtremeDSP (DSP48) slices with 18×18-bit MACs and 48-bit accumulators; the Virtex-6 FPGAs included DSP48E1 slices with 25×18-bit MACs and 48-bit accumulators; 7 series FPGAs and the Zynq-7000 SoCs include DSP48E1 slices with 25×18-bit MACs and 48-bit accumulators; and UltraScale/UltraScale+ devices include DSP48E2 slices with 27×18-bit MACs and 48-bit accumulators. There have been many additional improvements to Xilinx’s DSP48 slice along the way including steady clock-rate improvements with each new process generation, making the current DSP48E2 slice quite capable.
The IEEE paper discusses transformations to map butterflies to fewer LUTs, transformations that efficiently enable the use of DSP48 preadders for implementing butterfly adders, efficient mapping of data and twiddle-factor storage to BRAMs and distributed resources, efficient sharing of twiddle-factor memories for radix-2k algorithms, and ways to improve timing through retiming and pipelining.
It’s unfair of me to reveal any of the paper’s techniques in this blog, you need to get the IEEE paper for that, but I’m not shy about reporting some of the conclusion to tempt you into reading the paper: “The reported implementation results show an increase of through-put per slice of up to 350% and 400% compared with the best previously published work, for Virtex-4 and Virtex-6, respectively. In addition, a higher maximal clock frequency is obtained and fewer memory resources are needed. As the previously best reported results are using exactly the same architecture, and for Virtex-4 exactly the same algorithm, this clearly shows the benefit of the transformations proposed to improve the mapping from architecture to FPGA hardware structure in this paper.”
via Xcell Daily Blog articles http://ift.tt/2fBJIws
July 28, 2017 at 08:13AM