搜索

x
中国物理学会期刊

基于FPGA的脉冲Transformer硬件高能效加速器实现

CSTR: 32037.14.aps.75.20260085

An FPGA-based high-energy-efficiency hardware accelerator for spiking transformer

CSTR: 32037.14.aps.75.20260085
PDF
HTML
导出引用
  • 脉冲神经网络(spiking neural networks, SNNs)凭借低功耗、事件驱动和稀疏计算等特性, 在动态视觉等任务中展现出显著潜力, 但其算法优势在实际部署中仍受到传统计算架构的制约. 为突破事件驱动计算在能效与延迟上的硬件瓶颈, 本文针对Spikformer模型开展算法与硬件协同优化, 提出了一种基于现场可编程门阵列(field-programmable gate array, FPGA) 的脉冲Transformer通用加速器架构. 算法层面, 通过卷积层与批归一化(batch normalization, BN)层融合以及量化感知训练, 将Spikformer-1-384 模型参数规模由15.92 MB 压缩至原来的1/4, 并将精度损失控制在1%以内. 硬件层面, 基于Verilog设计了面向脉冲数据流的可配置加速器, 支持多时间步并行计算以及卷积、全连接、残差与注意力算子的灵活组合, 并提升并行度与存储带宽利用效率. 实验结果表明, 在 Xilinx Zynq UltraScale+MPSoC (xczu7ev-ffvc1156-2-i)平台上, 该加速器在CIFAR-10 数据集上时间步长4的端到端推理延迟约为53 ms, 其中卷积特征提取与注意力模块计算时间分别为48 ms和4.634 ms; 端到端系统功耗为7.181 W, 对应能效达到2.63 FPS/W, 整体性能与能效均优于Intel i9 CPU; 对于自注意力机制和前馈神经网络(multilayer perceptron, MLP)计算, 较GPU和CPU分别加速1.70×和5.73×. 本研究开源链接: https://github.com/tooddler/FPGA_SpikingTransformer.

     

    Spiking neural networks (SNNs) feature event-driven processing, sparse activation, and low-bit data representation, and therefore exhibit strong potential for energy-efficient intelligent computing, especially for edge-side deployment. Recently proposed spiking Transformer models combine temporal spike dynamics with global attention mechanisms, but their practical deployment efficiency is still constrained by conventional computing architectures due to mismatched dataflow patterns, intensive memory access, and insufficient support for temporal parallelism. To address the latency and energy-efficiency bottlenecks in spiking Transformer inference, this work presents an algorithm–hardware co-designed FPGA accelerator targeting the Spikformer model. At the algorithm level, a deployment-oriented lightweight optimization strategy is adopted by fusing convolution and batch normalization (BN) layers and applying quantization-aware training (QAT). The model parameters are quantized to INT8 while preserving spike-driven characteristics, reducing storage and computation complexity. For the Spikformer-1-384 network, the parameter size is compressed from 15.92 MB to 3.98 MB with accuracy degradation controlled within 1%. At the hardware level, a configurable accelerator architecture tailored for spiking dataflow is designed on a field-programmable gate array (FPGA), consisting of spike encoding, spiking convolution, and self-attention-MLP compute engines with modular organization.
    Multi-timestep parallel leaky integrate-and-fire (LIF) neuron processing is supported to exploit temporal parallelism, and operator fusion is applied to attention and feed-forward blocks to reduce intermediate off-chip memory traffic. In the attention path, spike-based matrix operations are implemented using bitwise logic and spike-count accumulation instead of conventional multipliers, significantly lowering DSP usage and improving compute density. A hierarchical memory and dataflow scheme combining DDR burst transfer, on-chip BRAM buffering, and ping-pong scheduling is further employed to enhance bandwidth utilization and pipeline continuity. The accelerator is implemented on a Xilinx Zynq UltraScale+ MPSoC platform and evaluated with the CIFAR-10 dataset. With four timesteps, the system achieves an end-to-end inference latency of 53 ms and a throughput of 18.9 FPS. The measured total power consumption is 7.181 W, corresponding to an energy efficiency of 2.63 FPS/W. For the attention and MLP block with input size (4, 1, 64, 384), the proposed design achieves 1.70× and 5.73× speedup over GPU and CPU implementations, respectively. The results demonstrate that the proposed co-optimized architecture provides an effective, scalable, and hardware-friendly solution for high-efficiency deployment of spiking Transformer models on resource-constrained edge platforms. The open-source link for this project is: https://github.com/tooddler/FPGA_SpikingTransformer.

     

    目录

    /

    返回文章
    返回