An asynchronouspipeline formed by Click elements enables the circuits to work inpipeline mode without any sacrifice of speed because of the selftimed characteristic of asynchronous circuits. Each computingcore has a 5x5 registers array that is fully connected by anasynchronous Mesh network, by which the input data can be fullyreused. A novel computing pattern called convolution-andpooling-integrated computing, which combines convolution andpooling computing together, is proposed to reduce the access to theintermediate data. These yield an 88% decrease of the access tooff-chip memory, which significantly reduces energy consumption.A CNN model, LeNet-5, is implemented in our accelerator with theFPGA of Xilinx VC707. The asynchronous computing core has 84%less dynamic power than that of the synchronous core. Theefficiency achieves 30.03 GOPS/W, which is 2.1 times better thanthat of previous works.Keywords—CNN; Energy-Efficient; accelerator; AsynchronousI. In t r o d u c t io nConvolutional Neural Networks (CNN) have been widelyused in the field of computer vision and show its greatadvantages in image classification, object detection and videosurveillance [1]. The inference of CNNs is usually realized byCPU and GPU. However, the CPU has limited computingresources and parallelism. Although GPU outperforms CPU inthe inference of CNNs because it is designed for parallelcomputing of large-scale data, but GPU consumes too muchpower (for example: 33W for NVIDIA GTX840M, and 235Wfor NVIDIA Tesla K40 [2,3]). Hence, CNN accelerators requirea trade-off between flexibility and energy efficiency. As weknow, ASIC design can obtain the best power efficiency butonly a certain CNN model can be implemented in ASIC circuitbecause of its worst flexibility. FPGA shows acceptableperformance but its fine-grained computing and routingresources limit the power efficiency and runtime reconfigurationfor different CNNs. To obtain a better flexibility and energyefficiency, some CNN accelerators adopt a coarse-graineddynamic reconfigurable architecture (CGRA) such as theEyeriss from MIT [4] and the Thinker [ 1 ] with high performanceand flexibility. On the other hand, the asynchronous circuits arecharacterized by their local data- or control-driven flow ofoperations, which differs from the global clock-driven flow ofsynchronous designs. This character enables the differentportions of the asynchronous circuits to operate at theirindividual ideal “frequencies”—or rather to operate and idle asneeded, consuming energy only when and where needed. Clockgating has a similar goal—enabling registers only whenneeded—but does not address the power drawn by thecentralized control and clock-tree buffers [5]. As a result,asynchronous logic has been advocated as a means of reducingpower consumption in a number of applications [6,7]. IBM’sTrueNorth which successfully implements Spiking NeuronNetworks (SNN) has only 65mW power [8], and the dataflowprocessing unit (DPU) from Wave Computing company withasynchronous processing element achieves 181 TOPS. In thispaper, we propose an asynchronous accelerator with dynamicreconfigurable architecture to achieve great flexibility, lowpower and high energy efficiency.II. D e s ig n o f t h e CNN A c c e l e r a t o rA. Architecture o f the acceleratorThe top-level architecture is shown in Fig. 1, in which theinput data is stored in the off-chip DRAM. The configurationinformation from the controller will be input into thecomputation array including six cores together with the registersarray. According to the configuration information, the activationfunction of processing element (PE) in each core, pooling wayand size, and the direction of data flow in the registers array forinput data reuse will be determined. The computation array isresponsible for convolution and pooling computing layer bylayer. The computation results will be stored into DRAMthrough the output buffer.