搜索

x
中国物理学会期刊

一种基于3D NAND存储器的存算一体架构及其系统技术协同优化仿真

CSTR: 32037.14.aps.74.20250891

A compute-in-memory architecture and system-technology codesign simulator based on 3D NAND flash

CSTR: 32037.14.aps.74.20250891
PDF
HTML
导出引用
  • 随着ChatGPT等大语言模型的发展, 产业界对硬件的算力、容量和功耗提出了新的需求. 存算一体(compute-in-memory, CIM)技术相较于传统近存计算, 减少了数据搬移, 显著降低功耗. 而在众多存储器中, 3D NAND闪存因其成熟的工艺制造技术和超高容量, 是最有可能实现大模型本地部署的候选方案. 然而, 目前针对3D NAND闪存CIM芯片的研究大多停留在学术研究阶段, 未基于产品级3D NAND闪存芯片进行系统性的CIM架构设计和大模型功能验证. 对此, 本文搭建了基于PyTorch框架的大语言模型仿真器平台来评估系统架构的性能, 并提出了一种基于源线背面切分工艺的通用3D NAND架构. 该架构通过改动3D NAND的源线制造工艺以支持CIM计算, 工艺成本极低, 可供产业界快速迭代, 并完善了相应的映射算法和流水线设计. 最后通过仿真器平台对所提出的架构在电流分布和量化的影响下进行了性能评估, 仿真结果表明所设计的产品级3D NAND芯片可以在GPT-2-124M大模型上做到20 tokens/s的生成速度和5.93 TOPS/W的能效比, 在GPT-2-355M大模型上做到8.5 tokens/s的生成速度和7.17 TOPS/W的能效比.

     

    The rapid advancement of large language models (LLM) such as ChatGPT has imposed unprecedented demands on hardware in terms of computational power, memory capacity, and energy efficiency. Compute-in-memory (CIM) technology, which integrates computing directly into memory arrays, has become a promising solution that can overcome the data movement bottlenecks of traditional von Neumann architectures, significantly reduce power consumption and achieve large-scale parallel processing. Among various non-volatile memory candidates, 3D NAND flash stands out due to its mature manufacturing process, ultrahigh density, and cost-effectiveness, making it a strong contender for commercial CIM deployment and local inference of large models.
    Despite these advantages, most of existing researches on 3D NAND-based CIM remain at an academic level, focusing on theoretical designs or small-scale prototypes, with little attention paid to system-level architecture design and functional validation using product-grade 3D NAND chips for LLM applications. To address this gap, we propose a novel CIM architecture based on 3D NAND flash, which utilizes a source line (SL) slicing technique to partition the array and perform parallel computation at minimal manufacturing cost. This architecture is complemented by an efficient mapping algorithm and pipelined dataflow, enabling system-level simulation and rapid industrial iteration.
    We develop a PyTorch-based behavioral simulator for LLM inference on the proposed hardware, evaluating the influences of current distribution and quantization on system performance. Our design supports INT4/INT8 quantization and employs dynamic weight storage logic to minimize voltage switching overhead, and is further optimized through hierarchical pipelining to maximize throughput under hardware constraints.
    Simulation results show that our simulation-grade 3D NAND compute-in-memory chip reaches generation speeds of 20 tokens/s with an energy efficiency of 5.93 TOPS/W on GPT-2-124M and 8.5 tokens/s with 7.17 TOPS/W on GPT-2-355M, respectively, while maintaining system-level reliability for open-state current distributions with σ < 2.5 nA; in INT8 mode, quantization error is the dominant accuracy bottleneck.
    Compared with previous CIM solutions, our architecture supports larger model loads, higher computational precision, and significantly reduced power consumption, as evidenced by comprehensive benchmarking. The SL slicing technique keeps array wastage below 3%, while hybrid wafer-bonding integrates high-density ADC/TIA circuits to improve hardware resource utilization.
    This work represents the first system-level simulation of LLM inference on product-grade 3D NAND CIM hardware, providing a standardized and scalable reference for future commercialization. The complete simulation framework is released on GitHub to facilitate further research and development. Future work will focus on device-level optimization of 3D NAND and iterative improvements of the simulator algorithm.

     

    目录

    /

    返回文章
    返回