-
RNA分子三级结构模建是分子生物物理学研究的基本问题之一, 对理解RNA的功能和设计新的结构有重要意义. RNA三级结构主要由主链和侧链上的7个扭转角确定, 准确预测这些扭转角是RNA分子三级结构模建的基础. 目前只有个别采用深度学习模型预测RNA分子扭转角的方法, 要用于模建RNA分子的三级结构其预测精度还有待进一步提高. 本文提出了一种预测RNA分子扭转角的深度学习方法1dRNA, 采用了考虑相邻核苷酸的卷积模型(DRCNN)和考虑全链核苷酸的超长短期记忆模型(DHLSTM)两种不同的深度学习模型. 结果显示, 与现有方法相比, 这两种模型都能提高RNA分子大部分扭转角的预测精度, DRCNN预测精度提高在5%到28%之间, DHLSTM预测精度提高在6%到15%之间. 结果还显示, α和γ角是最难预测的, 环区扭转角比螺旋区的扭转角难预测, 模型对预测序列长度的变化不敏感, 模型预测角度与decoys的角度偏差可用于模型质量评估.Modeling of RNA tertiary structure is one of the basic problems in molecular biophysics, and it is very important in understanding the biological function of RNA and designing new structures. RNA tertiary structure is mainly determined by seven torsions of main-chain and side-chain backbone, the accurate prediction of these torsion angles is the basis of modeling RNA tertiary structure. At present, there are only a few methods of using deep learning to predict RNA torsion angles, and the prediction accuracy needs further improving if it is used to model RNA tertiary structure. In this study, we also develop a deep learning method, 1dRNA, to predict RNA backbone torsions and pseudotorsion angles, including two different deep learning models, the convolution model (DRCNN) that considers the features of adjacent nucleotides and the Hyper-long-short-term memory model (DHLSTM) that considers the features of all the nucleotides. We then empirically show that DRCNN and DHLSTM outperform existing state-of-the-art methods under the same datasets, the prediction accuracy of DRCNN model is improved by 5% to 28% for β, δ, ζ, χ, η, and θ angle, and the prediction accuracy of DHLSTM model is improved by 6% to 15% for β, δ, ζ, χ, η, θ angle. The DRCNN model predicts better results than the DHLSTM model and the existing models in the δ, ζ, χ, η, θ angle, and the DHLSTM model predicts better results than the DRCNN model and the existing model in the β and ε angles, and the existing models predicted better results than the DRCNN model and DHLSTM model in the α and γ angles. The DRCNN model and the existing models predict a richer distribution of angles than the DHLSTM model. In terms of model stability, the DHLSTM model is much more stable than the DRCNN model and the existing models, with fewer outliers. The results also show that the α angle and γ angle are the most difficult to predict, the angles of the ring region is more difficult to predict than the angles of the helix region, the model is also not sensitive to the change of the target sequence length, and the deviation of the model prediction angle from the decoys can also be used to evaluate the RNA tertiary structures quality.
-
Keywords:
- RNA structure /
- torsional angle prediction /
- deep learning
[1] Jiao K, Hao Y Y, Wang F, et al. 2021 Biophys. Rep. 7 21Google Scholar
[2] Sun S, Chen X Z, Chen J, et al. 2021 Biophys. Rep. 7 8Google Scholar
[3] You Y L, Tang Z M, Lin H, Shi J L 2021 Biophys. Rep. 7 159Google Scholar
[4] Zhang Y, Wang J, Xiao Y 2022 J. Mol. Biol. 434 167452Google Scholar
[5] Zhang Y, Wang J, Xiao Y 2020 Comput. Struct. Biotechnol. J. 18 2416Google Scholar
[6] Wang J, Wang J, Huang Y Z, Xiao Y 2019 Int. J. Mol. Sci. 20 4116Google Scholar
[7] Wang J, Xiao Y 2017 Curr. Protoc. Bioinf. 57 5.9.1Google Scholar
[8] Wang J, Zhao Y J, Zhu C Y, Xiao Y 2015 Nucleic Acids Res. 43 e63Google Scholar
[9] Zhao Y J, Huang Y Y, Gong Z, et al. 2012 Sci. Rep. 2 734Google Scholar
[10] Wang J, Mao K K, Zhao Y J, Zeng C, Xiang J J, Zhang Y, Xiao Y 2017 Nucleic Acids Res. 45 6299Google Scholar
[11] Olson W K 1982 Topics in Nucleic Acid Structures (Part 2) (London: Macmillan Press) pp1–79
[12] Dor O, Zhou Y Q 2007 Proteins 68 76Google Scholar
[13] Xue B, Dor O, Faraggi E, Zhou Y Q 2008 Proteins 72 427Google Scholar
[14] Faraggi E, Xue B, Zhou Y Q 2009 Proteins 74 847Google Scholar
[15] Faraggi E, Yang Y D, Zhang S H, Zhou Y Q 2009 Structure 17 1515Google Scholar
[16] Faraggi E, Zhang T, Yang Y D, Kurgan L, Zhou Y Q 2012 J. Comput. Chem. 33 259Google Scholar
[17] Heffernan R, Paliwal K, Lyons J, et al. 2015 Sci. Rep. 5 11476Google Scholar
[18] Heffernan R, Yang Y D, Paliwal K, Zhou Y Q 2017 Bioinformatics 33 2842Google Scholar
[19] Hanson J, Paliwal K, Litfin T, Yang Y D, Zhou Y Q, Valencia A 2019 Bioinformatics 35 2403Google Scholar
[20] Mataeimoghadam F, Newton M A H, Dehzangi A, Karim A, Jayaram B, Ranganathan S, Sattar A 2020 Sci. Rep. 10 19430Google Scholar
[21] Singh J, Paliwal K, Singh J, Zhou Y Q 2021 J. Chem. Inf. Model. 61 2610Google Scholar
[22] Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman D J 2021 Mech. Sys. Signal Proc. 151 107398Google Scholar
[23] He K M, Zhang X Y, Ren S Q, Sun J 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV, USA, June 27–30, 2016 p770
[24] Nam H, Kim H E 2018 arXiv: 1805.07925v3 [cs.CV
[25] Clevert D A, Unterthiner T, Hochreiter S 2015 arXiv: 1511.07289v5 [cs.LG
[26] Jayasiri V, Wijerathne N 2020 https://nn.labml.ai/ [2023-04-02
[27] Hochreiter S, Schmidhuber J 1997 Neural Comput. 9 1735Google Scholar
[28] Tieleman T, Hinton G 2012 Lecture 6.5-RMSProp: Divide the Gradient by a Running Average of its Recent Magnitude (COURSERA: Neural Networks for Machine Learning
[29] Paszke A, Gross S, Massa F, et al. 2019 33rd Conference on Neural Information Processing Systems Vancouver, Canada, December 8, 2019 pp8026-8037
[30] Burley S K, Bhikadiya C, Bi C, Bittrich S, Chen L, Crichlow G V, et al 2021 Nucleic Acids Res. 49 D437Google Scholar
[31] Fu L M, Niu B F, Zhu Z W, Wu S T, Li W Z 2012 Bioinformatics 28 3150Google Scholar
[32] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J 1990 J. Mol. Biol. 215 403Google Scholar
[33] Rohatgi A 2022 Software available at https://automeris.io/ WebPlotDigitizer Version 4.6[software
[34] Lu X J, Bussemaker H J, Olson W K 2015 Nucleic Acids Res. 43 e142Google Scholar
[35] Vaswani A , Shazeer N, Parmar N, et al. 2017 arXiv: 1706. 03762v7 [cs.CL
-
图 2 DRCNN (a) 模型架构; (b) 模型中一维卷积层的原理; (c) 输出层. B, L, N, KS和Filters分别为训练中更新一次模型参数选择的序列数目、序列的长度、输入特征维度、卷积核的小大(卷积窗口一次能看到的相邻核苷酸数目)、卷积核的数目(卷积层的输出维度)
Fig. 2. DRCNN: (a) Network architecture; (b) Conv1d layer; (c) output layer. B, L, N, KS and Filters are batch size, sequence length, the size of the input, the size of the filter (the filter can see the number of nucleotides at one time), the number of filters.
图 3 DHLSTM (a) 模型架构; (b) HyperLSTM层; (c) 对每个核苷酸的处理单元HyperLSTMcell, 其中ht, ct和ht – 1, ct – 1分别是外部更大的LSTM在t和 t – 1时刻的隐藏态; $h'_t$, $c'_t$和$ {h}'_{t-1} $, $c'_{t - 1}$分别是更小的LSTM在t和t – 1时刻的隐藏态; (d) Hypercell单元. L, B, N, Hidden, Hyper和n_z分别为序列的长度、训练中更新一次模型参数选择的序列数目、输入特征维度、大LSTM层的输出维度、内部LSTM层的输出维度和改变大LSTM层权重的Hypercell单元里线性投影的维度, Px和Ph为动态可训练参数, 绑定在内部超网络里, 作用在输入态xt – 1和隐藏态, 初始值为全零张量
Fig. 3. DHLSTM: (a) Network architecture; (b) HyperLSTM layer; (c) HyperLSTMcell; ht, ct and ht – 1, ct – 1 are the states of the larger outer LSTM at time t and t – 1, respectively; $ h'_t $, $ c'_t $ and $ h'_{t-1} $, $ c'_{t-1} $ are the states of the smaller LSTM at time t and t – 1. (d) Hypercell. L, B, N, Hidden are sequence length, batch size, the size of the input, the size of the LSTM, and Hyper is the size of the smaller LSTM that alters the weights of the larger outer LSTM, n_z is the size of the feature vectors used to alter the larger LSTM weights, Px and Ph are dynamically trainable parameters, bound in the internal hypernetwork, acting on the input state xt – 1 and the hidden state, and the initial value is an all-zero tensor.
图 5 DRCNN(黄色)、DHLSTM(绿色)和SPOT-RNA-1D(紫色)在测试集I (a)、测试集II (b)和测试集III (c)上单个RNA链的MAE分布图. 每个盒子显示出一组数据的最大值、最小值、中位数、上下四分位数和异常值
Fig. 5. Distribution of MAE for individual RNA chains on test set I (a), test set II (b) and test set III (c) by DRCNN predictor (yellow), by DHLSTM (in green) and SPOT-RNA-1D (in purple). Each box shows the minimum, the maximum, the sample median, the first and third quartiles and outlier.
表 1 训练集、验证集和3个测试集的长度和二级结构信息(百分数是数据集不同配对类型的核苷酸数目占比)
Table 1. Length and secondary-structure information of training, validation and test sets. The number mentioned along with the base pairing type is the percentage of total nucleotides in the region.
数据集 序列长度区间数目 二级结构 20—50 50—100 100—200 200—300 300—400 400—512 括号 假结 不配对 训练集 50 179 46 1 7 1 55.10% 5.63% 39.36% 验证集 20 10 0 0 0 0 52.19% 9.8% 38.01% 测试集I 11 41 10 0 0 0 57.58% 2.81% 39.61% 测试集II 8 16 6 0 0 0 58.42% 5.25% 36.33% 测试集III 40 13 1 0 0 0 65.02% 2.67% 32.31% 表 2 DHLSTM, DRCNN和SPOT-RNA-1D在验证集和3个测试集上的MAE
Table 2. Performance comparison in terms of MAE on validation sets and three test sets by three models.
数据集 7个标准扭转角 伪角 α/(°) β/(°) γ/(°) δ/(°) ε/(°) ζ/(°) χ/(°) η/(°) θ/(°) DHLSTM 验证集 47.91 20.22 37.18 16.57 18.23 35.02 19.85 28.09 32.85 测试集I 48.20 20.66 37.13 13.08 18.82 30.27 17.33 25.74 29.22 测试集II 47.95 19.89 35.30 15.19 17.87 30.99 17.67 27.20 31.49 测试集III 45.45 22.30 40.80 13.51 21.43 30.69 16.96 23.87 29.84 DRCNN 验证集 44.67 19.96 35.31 13.86 22.20 31.62 19.49 24.77 30.22 测试集I 44.84 20.74 36.27 10.51 21.48 27.53 16.39 23.12 26.34 测试集II 43.41 19.55 35.45 12.19 22.71 28.13 17.16 24.28 28.12 测试集III 27.14 15.81 25.20 9.73 14.51 17.98 11.58 13.67 17.77 SPOT-
RNA-1D [21]验证集 45.18 20.58 33.88 17.99 20.72 37.50 23.01 33.55 37.02 测试集I 43.94 21.94 32.98 14.61 20.69 33.27 19.59 30.25 32.91 测试集II 39.50 18.92 29.47 16.01 17.46 28.91 18.20 28.14 30.25 测试集III 37.89 21.04 34.68 13.83 22.32 27.87 17.01 25.31 27.22 表 3 DHLSTM和DRCNN在测试集III不同配对类型中扭转角预测的MAE
Table 3. Performance according to mean absolute error by DHLSTM and DRCNN for nucleotides in different pairing type on test set III.
配对类型 七个标准扭转角 伪角 α/(°) β/(°) γ/(°) δ/(°) ε/(°) ζ/(°) χ/(°) η/(°) θ/(°) DHLSTM 括号 34.08 16.48 30.21 9.76 17.98 21.38 11.23 18.03 21.91 假结 34.20 14.98 27.06 6.80 14.25 20.29 10.98 27.41 18.02 环区 66.77 32.60 60.72 21.05 27.54 47.85 28.52 35.41 46.16 DRCNN 括号 19.43 11.40 18.54 6.65 11.84 12.0 8.30 10.90 12.94 假结 20.42 14.25 16.75 6.73 12.86 13.54 10.25 16.14 13.52 环区 40.84 23.26 37.44 15.59 19.07 29.07 18.44 19.25 27.08 -
[1] Jiao K, Hao Y Y, Wang F, et al. 2021 Biophys. Rep. 7 21Google Scholar
[2] Sun S, Chen X Z, Chen J, et al. 2021 Biophys. Rep. 7 8Google Scholar
[3] You Y L, Tang Z M, Lin H, Shi J L 2021 Biophys. Rep. 7 159Google Scholar
[4] Zhang Y, Wang J, Xiao Y 2022 J. Mol. Biol. 434 167452Google Scholar
[5] Zhang Y, Wang J, Xiao Y 2020 Comput. Struct. Biotechnol. J. 18 2416Google Scholar
[6] Wang J, Wang J, Huang Y Z, Xiao Y 2019 Int. J. Mol. Sci. 20 4116Google Scholar
[7] Wang J, Xiao Y 2017 Curr. Protoc. Bioinf. 57 5.9.1Google Scholar
[8] Wang J, Zhao Y J, Zhu C Y, Xiao Y 2015 Nucleic Acids Res. 43 e63Google Scholar
[9] Zhao Y J, Huang Y Y, Gong Z, et al. 2012 Sci. Rep. 2 734Google Scholar
[10] Wang J, Mao K K, Zhao Y J, Zeng C, Xiang J J, Zhang Y, Xiao Y 2017 Nucleic Acids Res. 45 6299Google Scholar
[11] Olson W K 1982 Topics in Nucleic Acid Structures (Part 2) (London: Macmillan Press) pp1–79
[12] Dor O, Zhou Y Q 2007 Proteins 68 76Google Scholar
[13] Xue B, Dor O, Faraggi E, Zhou Y Q 2008 Proteins 72 427Google Scholar
[14] Faraggi E, Xue B, Zhou Y Q 2009 Proteins 74 847Google Scholar
[15] Faraggi E, Yang Y D, Zhang S H, Zhou Y Q 2009 Structure 17 1515Google Scholar
[16] Faraggi E, Zhang T, Yang Y D, Kurgan L, Zhou Y Q 2012 J. Comput. Chem. 33 259Google Scholar
[17] Heffernan R, Paliwal K, Lyons J, et al. 2015 Sci. Rep. 5 11476Google Scholar
[18] Heffernan R, Yang Y D, Paliwal K, Zhou Y Q 2017 Bioinformatics 33 2842Google Scholar
[19] Hanson J, Paliwal K, Litfin T, Yang Y D, Zhou Y Q, Valencia A 2019 Bioinformatics 35 2403Google Scholar
[20] Mataeimoghadam F, Newton M A H, Dehzangi A, Karim A, Jayaram B, Ranganathan S, Sattar A 2020 Sci. Rep. 10 19430Google Scholar
[21] Singh J, Paliwal K, Singh J, Zhou Y Q 2021 J. Chem. Inf. Model. 61 2610Google Scholar
[22] Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman D J 2021 Mech. Sys. Signal Proc. 151 107398Google Scholar
[23] He K M, Zhang X Y, Ren S Q, Sun J 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV, USA, June 27–30, 2016 p770
[24] Nam H, Kim H E 2018 arXiv: 1805.07925v3 [cs.CV
[25] Clevert D A, Unterthiner T, Hochreiter S 2015 arXiv: 1511.07289v5 [cs.LG
[26] Jayasiri V, Wijerathne N 2020 https://nn.labml.ai/ [2023-04-02
[27] Hochreiter S, Schmidhuber J 1997 Neural Comput. 9 1735Google Scholar
[28] Tieleman T, Hinton G 2012 Lecture 6.5-RMSProp: Divide the Gradient by a Running Average of its Recent Magnitude (COURSERA: Neural Networks for Machine Learning
[29] Paszke A, Gross S, Massa F, et al. 2019 33rd Conference on Neural Information Processing Systems Vancouver, Canada, December 8, 2019 pp8026-8037
[30] Burley S K, Bhikadiya C, Bi C, Bittrich S, Chen L, Crichlow G V, et al 2021 Nucleic Acids Res. 49 D437Google Scholar
[31] Fu L M, Niu B F, Zhu Z W, Wu S T, Li W Z 2012 Bioinformatics 28 3150Google Scholar
[32] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J 1990 J. Mol. Biol. 215 403Google Scholar
[33] Rohatgi A 2022 Software available at https://automeris.io/ WebPlotDigitizer Version 4.6[software
[34] Lu X J, Bussemaker H J, Olson W K 2015 Nucleic Acids Res. 43 e142Google Scholar
[35] Vaswani A , Shazeer N, Parmar N, et al. 2017 arXiv: 1706. 03762v7 [cs.CL
计量
- 文章访问数: 2479
- PDF下载量: 132
- 被引次数: 0