搜索

x

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于现场可编程门阵列的高能效轻量化残差脉冲神经网络处理器实现

侯悦 项水英 邹涛 黄志权 石尚轩 郭星星 张雅慧 郑凌 郝跃

引用本文:
Citation:

基于现场可编程门阵列的高能效轻量化残差脉冲神经网络处理器实现

侯悦, 项水英, 邹涛, 黄志权, 石尚轩, 郭星星, 张雅慧, 郑凌, 郝跃

Implementation of high-efficiency, lightweight residual spiking neural network processor based on field-programmable gate arrays

HOU Yue, XIANG Shuiying, ZOU Tao, HUANG Zhiquan, SHI Shangxuan, GUO Xingxing, ZHANG Yahui, ZHENG Ling, HAO Yue
Article Text (iFLYTEK Translation)
PDF
HTML
导出引用
  • 随着脉冲神经网络(spiking neural network, SNN)在硬件部署优化方面的发展, 基于现场可编程门阵列(field-programmable gate array, FPGA)的SNN处理器因其高效性与灵活性成为研究热点. 然而, 现有方法依赖多时间步训练和可重配置计算架构, 增大了计算与存储压力, 降低了部署效率. 本文设计并实现了一种高能效、轻量化的残差SNN硬件加速器, 采用算法与硬件协同设计策略, 以优化SNN推理过程中的能效表现. 在算法上, 采用单时间步训练方法, 并引入分组卷积和批归一化(batch normalization, BN)层融合技术, 有效压缩网络规模至0.69M. 此外, 采用量化感知训练(quantization-aware training, QAT), 将网络参数精度限制为8 bit. 在硬件设计上, 本文通过层内资源复用提高FPGA资源利用率, 采用全流水层间架构提升计算吞吐率, 并利用块随机存取存储器(block random access memory, BRAM)存储网络参数和计算结果, 以提高存储效率. 实验表明, 该处理器在CIFAR-10数据集上分类准确率达到87.11%, 单张图片推理时间为3.98 ms, 能效为183.5 frames/(s·W), 较主流图形处理单元(graphics processing unit, GPU)平台能效提升至2倍以上, 与其他SNN处理器相比, 推理速度至少提升了4倍, 能效至少提升了5倍.
    With the development of hardware-optimized deployment of piking neural networks (SNNs), SNN processors based on field-programmable gate arrays (FPGAs) have become a research hotspot due to their efficiency and flexibility. However, existing methods rely on multi-timestep training and reconfigurable computing architectures, which increases computational and memory overhead, thus reducing deployment efficiency. This work presents an efficient and lightweight residual SNN accelerator that combines algorithm and hardware co-design to optimize inference energy efficiency. In terms of algorithm, we employ single-timesteps training, integrate grouped convolutions, and fuse batch normalization (BN) layers, thus compressing the network to only 0.69 M parameters. Quantization-aware training (QAT) further constrains all weights and activations to 8-bit precision. In terms of hardware, the reuse of intra-layer resource maximizes FPGA utilization, a full pipeline cross-layer architecture improves throughput, and on-chip block RAM (BRAM) stores network parameters and intermediate results to improve memory efficiency. The experimental results show that the proposed processor achieves a classification accuracy of 87.11% on the CIFAR-10 dataset, with an inference time of 3.98 ms per image and an energy efficiency of 183.5 FPS/W. Compared with mainstream graphics processing unit (GPU) platforms, it achieves more than double the energy efficiency. Furthermore, compared with other SNN processors, it achieves at least a fourfold increase in inference speed and a fivefold improvement in energy efficiency.
  • 图 1  ResNet-10脉冲神经网络结构

    Fig. 1.  ResNet-10 spiking neural network structure.

    图 2  标准卷积与分组卷积下ResNet-10各层参数量对比

    Fig. 2.  Comparison of parameter counts for each layer of ResNet-10 under standard convolution and group convolution.

    图 3  不同条件下测试准确率对比

    Fig. 3.  Comparison of test accuracy under different conditions.

    图 4  残差SNN处理器硬件总体架构图

    Fig. 4.  Overall hardware architecture of the residual SNN processor.

    图 5  卷积操作示意图

    Fig. 5.  Schematic diagram of the convolution operation.

    图 6  流水线设计 (a) 卷积数据处理流水线结构; (b)层间全流水架构

    Fig. 6.  Pipeline design: (a) Convolution data processing pipeline structure; (b) fully pipelined inter-layer architecture.

    表 1  残差SNN处理器资源利用率

    Table 1.  Resource utilization of residual SNN processors.

    名称消耗资源可用资源百分比/%
    LUTs13485942528031.71
    FF34172285056040.18
    BRAM674.5108062.45
    DSP3008427270.41
    下载: 导出CSV

    表 2  处理器和GPU平台在CIFAR-10数据集上的性能表现

    Table 2.  Performance of the processor and GPU platform on the CIFAR-10 dataset.

    硬件平台ZCU216 FPGAGeForce RTX 4060 Ti
    准确率/%88.1188.33
    功耗/W1.36951
    单张图片推理
    时间/ms
    3.980.243
    FPS2514115
    FPS/W183.580.7
    下载: 导出CSV

    表 3  在CIFAR-10数据集上与其他SNN处理器的性能比较

    Table 3.  Performance comparison with other SNN processors on the CIFAR-10 dataset.

    平台 E3NE[21] SCPU[22] SiBrain[23] Aliyev et al.[24] 本文>
    FPGA型号 XCVU13 P Virtex-7 Virtex-7 XCVU13 P ZCU216
    频率/MHz 150 200 200 100 100
    SNN模型 AlexNet ResNet-11 CONVNet(VGG-11) VGG-9 ResNet-10
    模型深度 8 11 6(11) 9 10
    精度/bits 6 8 8(8) 4 8
    参数量/M 0.3(9.2) 0.69
    LUTs/FFs 48k/50k 178k/127k 167k/136k(140k/122k) 135k/342k
    准确率/% 80.6 90.60 82.93(90.25) 86.6 87.11
    功率/W 4.7 1.738 1.628(1.555) 0.73 1.369
    时延/ms 70 25.4 1.4(18.9) 59 3.98
    FPS 14.3 39.43 696(53) 16.95 251
    FPS/W 3.0 22.65 438.8(34.1) 23.21 183.5
    下载: 导出CSV
  • [1]

    Shelhamer E, Long J, Darrell T 2016 IEEE T. Pattern Anal. 39 640

    [2]

    Redmon J, Divvala S, Girshick R, Farhadi A 2016 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Las Vegas, June 26—July 1, 2016 p779

    [3]

    施岳, 欧攀, 郑明, 邰含旭, 王玉红, 段若楠, 吴坚 2024 物理学报 73 104202Google Scholar

    Shi Y, Ou P, Zheng M, Tai H X, Wang Y H, Duan R N, Wu J 2024 Acta Phys. Sin. 73 104202Google Scholar

    [4]

    应大卫, 张思慧, 邓书金, 武海斌 2023 物理学报 72 144201Google Scholar

    Ying D W, Zhang S H, Deng S J, Wu H B 2023 Acta Phys. Sin. 72 144201Google Scholar

    [5]

    曹自强, 赛斌, 吕欣 2020 物理学报 69 084203Google Scholar

    Cao Z Q, Sai B, Lv X 2020 Acta Phys. Sin. 69 084203Google Scholar

    [6]

    Maass W 1997 Neural Networks 10 1659Google Scholar

    [7]

    Nunes J D, Carvalho M, Carneiro D, Cardoso J S 2022 IEEE Access, 10 60738Google Scholar

    [8]

    武长春, 周莆钧, 王俊杰, 李国, 胡绍刚, 于奇, 刘洋 2022 物理学报 71 148401Google Scholar

    Wu C C, Zhou P J, Wang J J, Li G, Hu S G, Yu Q, Liu Y 2022 Acta Phys. Sin. 71 148401Google Scholar

    [9]

    Aliyev I, Svoboda K, Adegbija T, Fellous J M 2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Kuala Lumpur, December 16-19, 2024 p413

    [10]

    Merolla P A, Arthur J V, Alvarez-Icaza R, Cassidy A S, Sawada J, Akopyan F, Jackson B L, Imam N, Guo C, Nakamura Y, Brezzo B, Vo I, Esser S K, Appuswamy R, Taba B, Amir A, Flickner M D, Risk W P, Manohar R, Modha D S 2014 Science 345 668Google Scholar

    [11]

    Davies M, Srinivasa N, Lin T H, Chinya G, Cao Y, Choday S H, Dimou G, Joshi P, Imam N, Jain S, Liao Y, Lin C K, Lines A, Liu R, Mathaikutty D, McCoy S, Paul A, Tse J, Venkataramanan G, Weng Y H, Wild A, Yang Y, Wang H 2018 IEEE Micro 38 82Google Scholar

    [12]

    何磊, 王堃, 吴晨, 陶卓夫, 时霄, 苗斯元, 陆少强 2025 中国科学: 信息科学 55 796Google Scholar

    He L, Wang K, Wu C, Tao Z F, Shi X, Miao S Y, Lu S Q 2025 Sci. Sin. Inf. 55 796Google Scholar

    [13]

    Gdaim S, Mtibaa A 2025 J. Real-Time Image Pr. 22 67Google Scholar

    [14]

    严飞, 郑绪文, 孟川, 李楚, 刘银萍 2025 现代电子技术 48 151

    Yan F, Zheng X W, Meng C, Li C, Liu Y P 2025 Modern Electron. Techn. 48 151

    [15]

    Liu Y J, Chen Y H, Ye W J, Gui Y 2022 IEEE T. Circuits I 69 2553

    [16]

    Ye W J, Chen Y H, Liu Y J 2022 IEEE T. Comput. Aid. D. 42 448

    [17]

    Panchapakesan S, Fang Z M, Li J 2022 ACM T. Reconfig. Techn. 15 48

    [18]

    Chen Q Y, Gao C, Fu Y X 2022 IEEE T. VLSI Syst. 30 1425Google Scholar

    [19]

    Wang S Q, Wang L, Deng Y, Yang Z J, Guo S S, Kang Z Y, Guo Y F, Xu W X 2020 J. Comput. Sci. Tech. 35 475Google Scholar

    [20]

    Biswal M R, Delwar T S, Siddique A, Behera P, Choi Y, Ryu J Y 2022 Sensors 22 8694Google Scholar

    [21]

    Gerlinghoff D, Wang Z, Gu X, Goh R S M, Luo T 2021 IEEE T. Parall. Distr. 33 3207

    [22]

    Chen Y H, Liu Y J, Ye W J, Chang C C 2023 IEEE T. Circuits II 70 3634

    [23]

    Chen Y H, Ye W J, Liu Y J, Zhou H H 2024 IEEE T. Circuits I 71 6482

    [24]

    Aliyev I, Lopez J, Adegbija T 2024 arXiv: 2411.15409[CS-Ar]

    [25]

    Stein R B, Hodgkin A L 1967 Proceedings of the Royal Society of London. Series B. Biological Sciences 167 64

    [26]

    Eshraghian J K, Ward M, Neftci E O, Wang X, Lenz G, Dwivedi G 2023 Proc. IEEE 111 1016Google Scholar

    [27]

    刘浩, 柴洪峰, 孙权, 云昕, 李鑫 2023 中国工程科学 25 61

    Liu H, Chai H F, Sun Q, Yun X, Li X 2023 Engineering 25 61

    [28]

    Krizhevsky A, Sutskever I, Hinton G E 2012 Adv. Neural Inf. Pro. Syst. 25 1097

    [29]

    Huang G, Liu S, van der Maaten L, Weinberger K 2018 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Salt Lake City, US, June 18-22, 2018 p2752

    [30]

    Krizhevsky A, Hinton G https://www.cs.toronto.edu/~kriz/cifar.html [2025-3-22]

    [31]

    Ioffe S, Szegedy C 2015 arXiv: 1502.03167[CS-LG]

    [32]

    Zheng J W 2021 M. S. Thesis (Xi’an: Xidian University)(in Chinese)[郑俊伟 2021 硕士学位论文 (西安: 西安电子科技大学)]

    [33]

    Jacob B, Kligys S, Chen B, Tang M, Howard A, Adam H, Kalenichenko D 2018 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Sale Lake City, June 18-22, 2018 p2704

    [34]

    Zhou S C, Wu Y X, Ni Z K, Zhou X Y, Wen H, Zou Y H 2016 arXiv: 1606.06160[cs. NE] https://doi.org/10.48550/arXiv.1606.06160

    [35]

    Liu Z, Cheng K T, Huang D, Xing E P, Shen Z 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition New Orleans, June 19-24, 2022 p4942

    [36]

    Chen Y H, Krishna T, Emer J S, Sze V 2016 IEEE J. Solid-St. Circ. 52 127

  • [1] 王永博, 唐曦, 赵乐涵, 张鑫, 邓进, 吴正茂, 杨俊波, 周恒, 吴加贵, 夏光琼. 基于Si3N4微环混沌光频梳的Tbit/s并行实时物理随机数方案. 物理学报, doi: 10.7498/aps.73.20231913
    [2] 全旭, 邱达, 孙智鹏, 张贵重, 刘嵩. 一个具有共存吸引子的四阶混沌系统动力学分析及FPGA实现. 物理学报, doi: 10.7498/aps.72.20230795
    [3] 张贵重, 全旭, 刘嵩. 一个具有超级多稳定性的忆阻混沌系统的分析与FPGA实现. 物理学报, doi: 10.7498/aps.71.20221423
    [4] 张亚君, 蔡佳林, 乔亚, 曾中明, 袁喆, 夏钶. 基于磁性隧道结的群体编码实现无监督聚类. 物理学报, doi: 10.7498/aps.71.20220252
    [5] 王童, 温娟, 吕康, 陈健中, 汪亮, 郭新. 仿生生物感官的感存算一体化系统. 物理学报, doi: 10.7498/aps.71.20220281
    [6] 武长春, 周莆钧, 王俊杰, 李国, 胡绍刚, 于奇, 刘洋. 基于忆阻器的脉冲神经网络硬件加速器架构设计. 物理学报, doi: 10.7498/aps.71.20220098
    [7] 康志伟, 刘拓, 刘劲, 马辛, 陈晓. 基于自归一化神经网络的脉冲星候选体选择. 物理学报, doi: 10.7498/aps.69.20191582
    [8] 吕晏旻, 闵富红. 基于现场可编程逻辑门阵列的磁控忆阻电路对称动力学行为分析. 物理学报, doi: 10.7498/aps.68.20190453
    [9] 王传福, 丁群. 基于混沌系统的SM4密钥扩展算法. 物理学报, doi: 10.7498/aps.66.020504
    [10] 许雅明, 王丽丹, 段书凯. 磁控二氧化钛忆阻混沌系统及现场可编程逻辑门阵列硬件实现. 物理学报, doi: 10.7498/aps.65.120503
    [11] 郭业才, 周林锋. 基于脉冲耦合神经网络和图像熵的各向异性扩散模型研究. 物理学报, doi: 10.7498/aps.64.194204
    [12] 邵书义, 闵富红, 吴薛红, 张新国. 基于现场可编程逻辑门阵列的新型混沌系统实现. 物理学报, doi: 10.7498/aps.63.060501
    [13] 张旭东, 朱萍, 谢小平, 何国光. 混沌神经网络的动态阈值控制. 物理学报, doi: 10.7498/aps.62.210506
    [14] 潘晶, 齐娜, 薛兵兵, 丁群. 基于现场可编程门阵列的手机短信息混沌加密系统设计方案及硬件实现. 物理学报, doi: 10.7498/aps.61.180504
    [15] 刘 强, 方锦清, 赵耿, 李永. 基于FPGA技术的混沌加密系统研究. 物理学报, doi: 10.7498/aps.61.130508
    [16] 高博, 余学峰, 任迪远, 李豫东, 崔江维, 李茂顺, 李明, 王义元. 静态存储器型现场可编程门阵列总剂量辐射损伤效应研究. 物理学报, doi: 10.7498/aps.60.036106
    [17] 周武杰, 禹思敏. 基于现场可编程门阵列技术的混沌数字通信系统——设计与实现. 物理学报, doi: 10.7498/aps.58.113
    [18] 周武杰, 禹思敏. 基于IEEE-754标准和现场可编程门阵列技术的混沌产生器设计与实现. 物理学报, doi: 10.7498/aps.57.4738
    [19] 何国光, 曹志彤. 混沌神经网络的控制. 物理学报, doi: 10.7498/aps.50.2103
    [20] 马余强, 张玥明, 龚昌德. Hopfield神经网络模型的恢复特性. 物理学报, doi: 10.7498/aps.42.1356
计量
  • 文章访问数:  332
  • PDF下载量:  14
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-03-26
  • 修回日期:  2025-04-24
  • 上网日期:  2025-05-27

/

返回文章
返回