Two new papers, one about hardware and one about software, describe the Snowflake CNN accelerator and accompanying Torch7 compiler developed by several researchers at Purdue U. The papers are titled “Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks” (the hardware paper) and “Compiling Deep Learning Models for Custom Hardware Accelerators” (the software paper). The authors of both papers are Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, and Eugenio Culurciello from Purdue’s School of Electrical and Computer Engineering and the Weldon School of Biomedical Engineering.
In the abstract, the hardware paper states:
“Snowflake, implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G-ops/s on the AlexNet CNN model, 36 frames per second and 116 Gops/s on the GoogLeNet CNN model and 17 frames per second and 122 G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”
The primary goal of the Snowflake accelerator design was computational efficiency. Efficiency and bandwidth are the two primary factors influencing accelerator throughput. The hardware paper says that the Snowflake accelerator achieves 95% computational efficiency and that it can process networks in real time. Because it is implemented on a Xilinx Zynq Z-7045, power consumption is a miserly 5W, according to the software paper, well within the power budget of many embedded systems.
The hardware paper also states:
“Snowflake with 256 processing units was synthesized on Xilinx’s Zynq XC7Z045 FPGA. At 250MHz, AlexNet achieved in 93:6 frames/s and 1:2GB/s of off-chip memory bandwidth, and 21:4 frames/s and 2:2GB/s for ResNet18.”
Here’s a block diagram of the Snowflake machine architecture from the software paper, from the micro level on the left to the macro level on the right:
There’s room for future performance improvement notes the hardware paper:
“The Zynq XC7Z045 device has 900 MAC units. Scaling Snowflake up by using three compute clusters, we will be able to utilize 768 MAC units. Assuming an accelerator frequency of 250 MHz, Snowflake will be able to achieve a peak performance of 384 G-ops/s. Snowflake can be scaled further on larger FPGAs by increasing the number of clusters.”
This is where I point out that a Zynq Z-7100 SoC has 2020 “MAC units” (actually, DSP48E1 slices)—which is a lot more than you find on the Zynq Z-7045 SoC—and the Zynq UltraScale+ ZU15EG MPSoC has 3528 DSP48E2 slices—which is much, much larger still. If speed and throughput are what you desire in a CNN accelerator, then either of these parts would be worthy of consideration for further development.
via Xcell Daily Blog articles http://ift.tt/2fBJIws
August 11, 2017 at 04:14AM