Spiking neural networks (SNNs) feature event-driven processing, sparse activation, and low-bit data representation, and therefore exhibit strong potential for energy-efficient intelligent computing, especially for edge-side deployment. Recently proposed spiking Transformer models combine temporal spike dynamics with global attention mechanisms, but their practical deployment efficiency is still constrained by conventional computing architectures due to mismatched dataflow patterns, intensive memory access, and insufficient support for temporal parallelism. To address the latency and energy-efficiency bottlenecks in spiking Transformer inference, this work presents an algorithm–hardware co-designed FPGA accelerator targeting the Spikformer model. At the algorithm level, a deployment-oriented lightweight optimization strategy is adopted by fusing convolution and batch normalization (BN) layers and applying quantization-aware training (QAT). The model parameters are quantized to INT8 while preserving spike-driven characteristics, reducing storage and computation complexity. For the Spikformer-1-384 network, the parameter size is compressed from 15.92 MB to 3.98 MB with accuracy degradation controlled within 1%. At the hardware level, a configurable accelerator architecture tailored for spiking dataflow is designed on a field-programmable gate array (FPGA), consisting of spike encoding, spiking convolution, and self-attention-MLP compute engines with modular organization.
Multi-timestep parallel leaky integrate-and-fire (LIF) neuron processing is supported to exploit temporal parallelism, and operator fusion is applied to attention and feed-forward blocks to reduce intermediate off-chip memory traffic. In the attention path, spike-based matrix operations are implemented using bitwise logic and spike-count accumulation instead of conventional multipliers, significantly lowering DSP usage and improving compute density. A hierarchical memory and dataflow scheme combining DDR burst transfer, on-chip BRAM buffering, and ping-pong scheduling is further employed to enhance bandwidth utilization and pipeline continuity. The accelerator is implemented on a Xilinx Zynq UltraScale+ MPSoC platform and evaluated with the CIFAR-10 dataset. With four timesteps, the system achieves an end-to-end inference latency of 53 ms and a throughput of 18.9 FPS. The measured total power consumption is 7.181 W, corresponding to an energy efficiency of 2.63 FPS/W. For the attention and MLP block with input size (4, 1, 64, 384), the proposed design achieves 1.70× and 5.73× speedup over GPU and CPU implementations, respectively. The results demonstrate that the proposed co-optimized architecture provides an effective, scalable, and hardware-friendly solution for high-efficiency deployment of spiking Transformer models on resource-constrained edge platforms. The open-source link for this project is:
https://github.com/tooddler/FPGA_SpikingTransformer.