x

## 留言板

Hybrid parallel optimization of density matrix renormalization group method

## Hybrid parallel optimization of density matrix renormalization group method

Chen Fu-Zhou, Cheng Chen, Luo Hong-Gang
PDF
HTML
• #### Abstract

Density matrix renormalization group (DMRG), as a numerical method of solving the ground state of one-dimensional strongly-correlated lattice model with very high accuracy, requires expensive computational and memory cost when applied to two- and quasi-two-dimensional problems. The number of DMRG kept states is generally very large to achieve a reliable accuracy for these applications, which results in numerous matrix and vector operations and unbearably consuming time in the absence of the proper parallelization. However, due to its sequential nature, the parallelization of DMRG algorithm is usually not straightforward. In this work, we propose a new hybrid parallelization strategy for the DMRG method. It takes advantage of the computing capability of both central processing unit (CPU) and graphics processing unit (GPU) of the computer. In order to achieve as many as DMRG kept states within a limited GPU memory, we adopt the four-block formulation of the Hamiltonian rather than the two-block formulation. The later consumes much more memories, which has been used in another pioneer work on the hybrid parallelization of the DMRG algorithm, and only a small number of DMRG kept states are available. Our parallel strategy focuses on the diagonalization of the Hamiltonian, which is the most time-consuming part of the whole DMRG procedure. A hybrid parallelization strategy of diagonalization method is implemented, in which the required data for diagonalization are distributed on both the host and GPU memory, and the data exchange between them is negligible in our data partitioning scheme. The matrix operations are also shared on both CPU and GPU when the Hamiltonian acts on a wave function, while the distribution of these operations is determined by a load balancing strategy. Taking fermionic Hubbard model for example, we examine the running performance of the hybrid parallelization strategy with different DMRG kept states and provide corresponding performance benchmark. On a 4-leg ladder, we employ the conserved quantities with U(1) symmetry of the model and a good-quantum number based task scheduling to further reduce the GPU memory cost. We manage to obtain a moderate speedup of the hybrid parallelization for a wide range of DMRG kept states. In our example, the ground state energy with high accuracy is obtained by the extrapolation of the results, with different numbers of states kept, and we show charge stripes which are usually experimentally observed in high-temperature superconductors. In this case, we keep 104 DMRG states and the GPU memory cost is less than 12 Gigabytes.

#### References

 [1] White S R 1992 Phys. Rev. Lett. 69 2863 [2] White S R 1993 Phys. Rev. B 48 10345 [3] Schollwöck U 2005 Rev. Mod. Phys. 77 259 [4] Schollwöck U 2011 Annals of Physics 326 96 [5] Xiang T 1996 Phys. Rev. B 53 R10445 [6] White S R, Martin R L 1999 J. Chem. Phys. 110 4127 [7] Luo H G, Qin M P, Xiang T 2010 Phys. Rev. B 81 235129 [8] Yang J, Hu W, Usvyat D, Matthews D, Schütz M, Chan G K L 2014 Science 345 640 [9] Cazalilla M A, Marston J B 2002 Phys. Rev. Lett. 88 256403 [10] Luo H G, Xiang T, Wang X Q 2003 Phys. Rev. Lett. 91 049701 [11] White S R, Feiguin A E 2004 Phys. Rev. Lett. 93 076401 [12] Cheng C, Mondaini R, Rigol M 2018 Phys. Rev. B 98 121112 [13] Zheng B X, Chung C M, Corboz P, Ehlers G, Qin M P, Noack R M, Shi H, White S R, Zhang S, Chan G K L 2017 Science 358 1155 [14] Huang E W, Mendl C B, Liu S, Johnston S, Jiang H C, Moritz B, Devereaux T P 2017 Science 358 1161 [15] Dagotto E 1994 Rev. Mod. Phys. 66 763 [16] Keimer B, Kivelson S A, Norman M R, Uchida S, Zaanen J 2015 Nature 518 179 [17] Fradkin E, Kivelson S A, Tranquada J M 2015 Rev. Mod. Phys. 87 457 [18] Yan S, Huse D A, White S R 2011 Science 332 1173 [19] Savary L, Balents L 2017 Rep. Prog. Phys. 80 016502 [20] Alvarez G 2012 Comput. Phys. Commun. 183 2226 [21] Tzeng Y C 2012 Phys. Rev. B 86 024403 [22] Legeza O, Röder J, Hess B A 2003 Phys. Rev. B 67 125114 [23] Legeza O, Sólyom J 2003 Phys. Rev. B 68 195116 [24] White S R 1996 Phys. Rev. Lett. 77 3633 [25] Hubig C, McCulloch I P, Schollwöck U, Wolf F A 2015 Phys. Rev. B 91 155115 [26] White S R 2005 Phys. Rev. B 72 180403 [27] Stoudenmire E M, White S R 2013 Phys. Rev. B 87 155137 [28] Hager G, Jeckelmann E, Fehske H, Wellein G 2004 J. Comput. Phys. 194 795 [29] Chan G K L 2004 J. Chem. Phys. 120 3172 [30] Nemes C, Barcza G, Nagy Z, Örs Legeza, Szolgay P 2014 Comput. Phys. Commun. 185 1570 [31] Siro T, Harju A 2012 Comput. Phys. Commun. 183 1884 [32] Lutsyshyn Y 2015 Comput. Phys. Commun. 187 162 [33] Yu J, Hsiao H C, Kao Y J 2011 Comput. Fluids 45 55 [34] Ehlers G, White S R, Noack R M 2017 Phys. Rev. B 95 125125 [35] Davidson E R 1975 J. Comput. Phys. 17 87 [36] Sadkane M, Sidje R B 1999 Numer. Algorithms 20 217 [37] Tranquada J M, Sternlieb B J, Axe J D, Nakamura Y, Uchida S 1995 Nature 375 561 [38] Comin R, Damascelli A 2016 Annu. Rev. Condens. Matter Phys. 7 369

#### Cited By

• 图 1  超块中的四个子块

Figure 1.  4 Sub-blocks of super-block

图 2  CPU中作用哈密顿量在波函数上的性能　(a)矩阵乘法的浮点性能; (b)作用哈密顿量于波函数的性能, 及矩阵乘法中的最大矩阵尺寸

Figure 2.  Performance of acting the Hamiltonian on the wave function in CPU: (a) The matrix multiplication performance; (b) the performance of acting the Hamiltonian on the wave function, and the maximum matrix size of the matrix multiplications.

图 3  对角化哈密顿量和作用哈密顿量到波函数操作占总计算时间的比例

Figure 3.  Time ratio of diagonalization of the Hamiltonian and acting the Hamiltonian on the wave function to the total time cost.

图 4  存储临时数据, 子块算符需要的GPU显存

Figure 4.  The GPU memory cost of temporary data and sub-block operators.

图 5  异构并行的性能　(a)加速比; (b) Davidson方法中的向量占用GPU显存; (c)作用哈密顿量到波函数部分的性能

Figure 5.  Performance of hybrid parallel strategy: (a) The speedup; (b) the GPU memory cost of vectors in Davidson; (c) the performance of $H\left|{\psi}\right\rangle$

图 6  基态能量关于截断误差的函数(直线表示对基态能量的线性外推, 直至截断误差为0)

Figure 6.  Groundstate energy as a function of truncation error. The straight line gives a linear extrapolation of the ground energy until 0 truncation-error.

图 7  对于16 × 4 Hubbard模型, U = 8.0时的基态电荷密度分布(可以观察到明显的电荷密度条纹)

Figure 7.  Ground state density profile for the 16 × 4 Hubbard ladder with U = 8.0. Charge density stripes can be clearly observed.

•  [1] White S R 1992 Phys. Rev. Lett. 69 2863 [2] White S R 1993 Phys. Rev. B 48 10345 [3] Schollwöck U 2005 Rev. Mod. Phys. 77 259 [4] Schollwöck U 2011 Annals of Physics 326 96 [5] Xiang T 1996 Phys. Rev. B 53 R10445 [6] White S R, Martin R L 1999 J. Chem. Phys. 110 4127 [7] Luo H G, Qin M P, Xiang T 2010 Phys. Rev. B 81 235129 [8] Yang J, Hu W, Usvyat D, Matthews D, Schütz M, Chan G K L 2014 Science 345 640 [9] Cazalilla M A, Marston J B 2002 Phys. Rev. Lett. 88 256403 [10] Luo H G, Xiang T, Wang X Q 2003 Phys. Rev. Lett. 91 049701 [11] White S R, Feiguin A E 2004 Phys. Rev. Lett. 93 076401 [12] Cheng C, Mondaini R, Rigol M 2018 Phys. Rev. B 98 121112 [13] Zheng B X, Chung C M, Corboz P, Ehlers G, Qin M P, Noack R M, Shi H, White S R, Zhang S, Chan G K L 2017 Science 358 1155 [14] Huang E W, Mendl C B, Liu S, Johnston S, Jiang H C, Moritz B, Devereaux T P 2017 Science 358 1161 [15] Dagotto E 1994 Rev. Mod. Phys. 66 763 [16] Keimer B, Kivelson S A, Norman M R, Uchida S, Zaanen J 2015 Nature 518 179 [17] Fradkin E, Kivelson S A, Tranquada J M 2015 Rev. Mod. Phys. 87 457 [18] Yan S, Huse D A, White S R 2011 Science 332 1173 [19] Savary L, Balents L 2017 Rep. Prog. Phys. 80 016502 [20] Alvarez G 2012 Comput. Phys. Commun. 183 2226 [21] Tzeng Y C 2012 Phys. Rev. B 86 024403 [22] Legeza O, Röder J, Hess B A 2003 Phys. Rev. B 67 125114 [23] Legeza O, Sólyom J 2003 Phys. Rev. B 68 195116 [24] White S R 1996 Phys. Rev. Lett. 77 3633 [25] Hubig C, McCulloch I P, Schollwöck U, Wolf F A 2015 Phys. Rev. B 91 155115 [26] White S R 2005 Phys. Rev. B 72 180403 [27] Stoudenmire E M, White S R 2013 Phys. Rev. B 87 155137 [28] Hager G, Jeckelmann E, Fehske H, Wellein G 2004 J. Comput. Phys. 194 795 [29] Chan G K L 2004 J. Chem. Phys. 120 3172 [30] Nemes C, Barcza G, Nagy Z, Örs Legeza, Szolgay P 2014 Comput. Phys. Commun. 185 1570 [31] Siro T, Harju A 2012 Comput. Phys. Commun. 183 1884 [32] Lutsyshyn Y 2015 Comput. Phys. Commun. 187 162 [33] Yu J, Hsiao H C, Kao Y J 2011 Comput. Fluids 45 55 [34] Ehlers G, White S R, Noack R M 2017 Phys. Rev. B 95 125125 [35] Davidson E R 1975 J. Comput. Phys. 17 87 [36] Sadkane M, Sidje R B 1999 Numer. Algorithms 20 217 [37] Tranquada J M, Sternlieb B J, Axe J D, Nakamura Y, Uchida S 1995 Nature 375 561 [38] Comin R, Damascelli A 2016 Annu. Rev. Condens. Matter Phys. 7 369
•  [1] HU JIA-ZHEN, TANG KUN-FA. A GENERALIZED MIXED SPIN MODEL: A RENORMALISATION GROUP APPROACH. Acta Physica Sinica, 1986, 35(8): 1048-1054. doi: 10.7498/aps.35.1048 [2] YE QING, TANG KUN-FA, HU JIA-ZHEN. APPLICATION OF AN EXACT DECIMATION TRANSFORMATION WITH MEAN-FIELD APPROXIMATION METHOD TO THE POTTS MODEL. Acta Physica Sinica, 1987, 36(7): 1019-1026. doi: 10.7498/aps.36.1019 [3] Ye Qing ;Tang Kun-fa ; Hu Jia-zhen. APPLICATION OF AN EXACTDECIMATION TRANSFORMATION WITH MEAN一FIELD APPROXIMATION METHOD TO THE POTTS MODEL. Acta Physica Sinica, 1987, 36(8): 1019-1026. [4] ZHAO HUI, WANG YONG-SHENG, XU ZHENG, HOU YAN-BING, XU XU-RONG. PARALLEL MODEL FOR PHOTOSTIMULATED LUMINESCENCE. Acta Physica Sinica, 1998, 47(2): 333-339. doi: 10.7498/aps.47.333 [5] Xiao Jun, Li Deng-Yu, Wang Ya-Li, Shi Yi-Shi. Ptychographical algorithm of the parallel scheme. Acta Physica Sinica, 2016, 65(15): 154203. doi: 10.7498/aps.65.154203 [6] DING JIAN-WEN, FANG XIAN-CHENG, YAN XIAO-HONG, DUAN ZHU-PING. HOPPING CONDUCTIVITY OF NANOSTRUCTURED CHAIN: REAL-SPACE RENORMALIZATION GROUP APPROACH. Acta Physica Sinica, 1999, 48(2): 314-319. doi: 10.7498/aps.48.314 [7] WANG RONG. ON THE ISOMORPHISM BETWEEN GAUGE GROUPS BEFORE AND AFTER RENORMALIZATION, IN THE PRESENCE OF ABEL SUBGROUPS AND HIGGS FIELDS. Acta Physica Sinica, 1981, 30(6): 731-746. doi: 10.7498/aps.30.731 [8] ZHU JIAN-YANG. STUDY OF THE TWO-DIMENSIONAL SQUARE LATTICE PERCOLATION MODEL. Acta Physica Sinica, 1993, 42(6): 880-885. doi: 10.7498/aps.42.880 [9] Li Qing-Du, Zhou Hong-Wei, Yang Xiao-Song. A study of basin of attraction of the simplest walking model based on heterogeneous computation. Acta Physica Sinica, 2012, 61(4): 040503. doi: 10.7498/aps.61.040503 [10] GAO CHONG-SHOU. EIGHTFOLD WAY AND THE CLASSIFICATION OF THE STRONGLY INTERACTING PARTICLES. Acta Physica Sinica, 1964, 105(12): 1187-1198. doi: 10.7498/aps.20.1187 [11] Dong Shao-jing. THE CALCULATION OF HEAVY QUARK FORCE AND POTENTIAL IN SU(2) LATTICE GAUGE THEORY. Acta Physica Sinica, 1986, 35(9): 1248-1252. doi: 10.7498/aps.35.1248 [12] WANG JIA-ZHU, BI PIN-ZHEN, YIN PENG-CHENG. PROLATE ELLIPSOLIDAL BAG MODEL FOR THE HADRON WITH HEAVY QUARKNIUM. Acta Physica Sinica, 1981, 30(12): 1707-1712. doi: 10.7498/aps.30.1707 [13] Deng Luo-Gen, Luo Li-Yuan. Enhancement factor formulation of doped nematic liquid crystals in the presence of photoisomerization. Acta Physica Sinica, 2007, 56(3): 1480-1488. doi: 10.7498/aps.56.1480 [14] Liang Xiao, Qian Zhi-Hong, Tian Hong-Liang, Wang Xue. Markov decision model based handoff selection algorithm for heterogeneous wireless networks. Acta Physica Sinica, 2016, 65(23): 236402. doi: 10.7498/aps.65.236402 [15] Huang Shu-Wen, Liu Tao, Wang Ke-Lin. Exact diagonalization solution of DNA model systems of limited grid. Acta Physica Sinica, 2010, 59(3): 2033-2037. doi: 10.7498/aps.59.2033 [16] Ren Xue-Zao, He Shu, Cong Hong-Lu, Wang Xu-Wen. Two-site Hubbard-holstein model polaron of quantum entanglement properties. Acta Physica Sinica, 2012, 61(12): 124207. doi: 10.7498/aps.61.124207 [17] Zhang Xing-Gang, Dai Dan. Lattice model for pressure problems in two-dimensional granular columns. Acta Physica Sinica, 2017, 66(20): 204501. doi: 10.7498/aps.66.204501 [18] Zhang Chang-Ping, Liu Cheng-Zhou. The renormalized energy-momentum tensor and Casimir effect of Dirac field in two-dimensional static spacetime. Acta Physica Sinica, 2007, 56(4): 1928-1937. doi: 10.7498/aps.56.1928 [19] Wang Hong, Lou Ping, Zhuang Yong-He. Flow equations for solving elementary excitation energy spectrum of the t-J model. Acta Physica Sinica, 2004, 53(2): 577-581. doi: 10.7498/aps.53.577 [20] Chen He-Sheng. Phase transition of lattice quantum chromodynamics with 2+1 flavor fermions at finite temperature and finite density. Acta Physica Sinica, 2009, 58(10): 6791-6797. doi: 10.7498/aps.58.6791
•  Citation:
##### Metrics
• Abstract views:  189
• Cited By: 0
##### Publishing process
• Received Date:  22 April 2019
• Accepted Date:  16 May 2019
• Available Online:  16 August 2019
• Published Online:  01 June 2019

## Hybrid parallel optimization of density matrix renormalization group method

###### Corresponding author: Luo Hong-Gang, luohg@lzu.edu.cn
• 1. School of Physical Science and Technology, Lanzhou University, Lanzhou 730000, China
• 2. Beijing Computational Science Research Center, Beijing 100084, China

Abstract: Density matrix renormalization group (DMRG), as a numerical method of solving the ground state of one-dimensional strongly-correlated lattice model with very high accuracy, requires expensive computational and memory cost when applied to two- and quasi-two-dimensional problems. The number of DMRG kept states is generally very large to achieve a reliable accuracy for these applications, which results in numerous matrix and vector operations and unbearably consuming time in the absence of the proper parallelization. However, due to its sequential nature, the parallelization of DMRG algorithm is usually not straightforward. In this work, we propose a new hybrid parallelization strategy for the DMRG method. It takes advantage of the computing capability of both central processing unit (CPU) and graphics processing unit (GPU) of the computer. In order to achieve as many as DMRG kept states within a limited GPU memory, we adopt the four-block formulation of the Hamiltonian rather than the two-block formulation. The later consumes much more memories, which has been used in another pioneer work on the hybrid parallelization of the DMRG algorithm, and only a small number of DMRG kept states are available. Our parallel strategy focuses on the diagonalization of the Hamiltonian, which is the most time-consuming part of the whole DMRG procedure. A hybrid parallelization strategy of diagonalization method is implemented, in which the required data for diagonalization are distributed on both the host and GPU memory, and the data exchange between them is negligible in our data partitioning scheme. The matrix operations are also shared on both CPU and GPU when the Hamiltonian acts on a wave function, while the distribution of these operations is determined by a load balancing strategy. Taking fermionic Hubbard model for example, we examine the running performance of the hybrid parallelization strategy with different DMRG kept states and provide corresponding performance benchmark. On a 4-leg ladder, we employ the conserved quantities with U(1) symmetry of the model and a good-quantum number based task scheduling to further reduce the GPU memory cost. We manage to obtain a moderate speedup of the hybrid parallelization for a wide range of DMRG kept states. In our example, the ground state energy with high accuracy is obtained by the extrapolation of the results, with different numbers of states kept, and we show charge stripes which are usually experimentally observed in high-temperature superconductors. In this case, we keep 104 DMRG states and the GPU memory cost is less than 12 Gigabytes.

Reference (38)

/