|
Biophysics 2024
基于词嵌入的机器学习方法预测RNA柔性
|
Abstract:
RNA分子的动力学与其功能密切相关。RNA分子的柔性,作为其动力学最基本的特性之一,已被广泛用于研究其折叠性质、结构稳定性和配体结合能力等诸多方面。实验测定RNA柔性的方法往往比较耗时费力,因此急需发展一种快速、准确的理论方法来预测RNA的柔性。为此,本文提出了一种机器学习方法RNAfwe来预测RNA柔性,该方法采用词嵌入技术提取RNA序列特征。RNAfwe与同类基于序列的RNAflex方法比较,结果显示:相比于使用独热编码的RNAflex (One-Hot),RNAfwe在训练和测试集上都获得了更高的皮尔逊相关系数(PCC) 0.5017和0.4704,这表明词嵌入相较于独热编码可从RNA序列中提取与柔性更相关的特征;相比于利用进化信息的RNAflex (PSSM),尽管RNAfwe的性能稍差,但前者需要知道足够的同源序列。这项工作有助于RNA动力学性质的研究,另外为词嵌入技术广泛用于生物信息学研究提供了支持。
RNA molecular dynamics is closely related to their functions. The flexibility of RNA molecules, as one of the most fundamental characteristics of their dynamics, has been widely used to study their folding properties, structural stability, ligand binding ability and so on. Experimental methods for measuring RNA flexibility are often time-consuming and labor intensive, so there is an urgent need to develop a fast and accurate theoretical method to predict RNA flexibility. To this end, we propose a machine learning method, RNAfwe, to predict RNA flexibility, which uses the word embedding technique to extract RNA sequence features. The comparison of RNAfwe with the similar sequence-based RNAflex method shows that compared with RNAflex (One-Hot), RNAfwe obtains higher Pearson correlation coefficients (PCC) of 0.5017 and 0.4704 on both training and test sets, indicating that the word embedding could extract the more related features to flexibility from RNA sequences than the one-hot encoding. Compared with RNAflex (PSSM) which uses evolutionary information, although RNAfwe has a slightly inferior performance, the former requires the knowledge of sufficient homologous sequences. This work contributes to the study of RNA dynamic properties, and provides the support for word embedding technique to be widely used in bioinformatics research.
[1] | Carugo, O. and Argos, P. (1998) Accessibility to Internal Cavities and Ligand Binding Sites Monitored by Protein Crystallographic Thermal Factors. Proteins, Structure, Function, and Bioinformatics, 31, 201-213. https://doi.org/10.1002/(SICI)1097-0134(19980501)31:2<201::AID-PROT9>3.0.CO;2-O |
[2] | Schneider, B., Gelly, J., de Brevern, A.G., et al. (2014) Local Dynamics of Proteins and DNA Evaluated from Crystallographic B Factors. ActaCrystallographica Section D Biological Crystallography, 70, 2413-2419. https://doi.org/10.1107/S1399004714014631 |
[3] | Liu, Q., Kwoh, C.K. and Li, J. (2013) Binding Affinity Prediction for Protein-Ligand Complexes Based on β Contacts and B Factor. Journal of Chemical Information and Modeling, 53, 3076-3085. https://doi.org/10.1021/ci400450h |
[4] | Li, C., Lv, D., Zhang, L., et al. (2016) Approach to the Unfolding and Folding Dynamics of Add A-Riboswitch upon Adenine Dissociation Using a Coarse-Grained Elastic Network Model. The Journal of Chemical Physics, 145, Article ID: 014104. https://doi.org/10.1063/1.4954992 |
[5] | Hu, Y., Cheng, K., He, L., et al. (2021) NMR-Based Methods for Protein Analysis. Analytical Chemistry, 93, 1866-1879. https://doi.org/10.1021/acs.analchem.0c03830 |
[6] | Ishima, R. and Torchia, D. (2000) Protein Dynamics from NMR. Nature Structural Biology, 7, 740-743. https://doi.org/10.1038/78963 |
[7] | Sasmal, D.K., Pulido, L.E., Kasal, S., et al. (2016) Single-Molecule Fluorescence Resonance Energy Transfer in Molecular Biology. Nanoscale, 8, 19928-19944. https://doi.org/10.1039/C6NR06794H |
[8] | Hoshino, M., Adachi, S. and Koshihara, S. (2015) Crystal Structure Analysis of Molecular Dynamics Using Synchrotron X-Rays. CrystEngComm, 17, 8786-8795. https://doi.org/10.1039/C5CE01128K |
[9] | Christoforides, E., Fourtaka, K., Andreou, A., et al. (2020) X-Ray Crystallography and Molecular Dynamics Studies of the Inclusion Complexes of Geraniol in β-Cyclodextrin, Heptakis (2, 6-di-O-methyl)-β-Cyclodextrin and Heptakis (2, 3, 6-tri-O-methyl)-β-Cyclodextrin. Journal of Molecular Structure, 1202, Article ID: 127350. https://doi.org/10.1016/j.molstruc.2019.127350 |
[10] | Scott, A.H. and Ron, O.D. (2018) Molecular Dynamics Simulation for All. Neuron, 99, 1129-1143. https://doi.org/10.1016/j.neuron.2018.08.011 |
[11] | Mccammon, J.A., Gelin, B.R. and Karplus, M. (1977) Dynamics of Folded Proteins. Nature, 267, 585-590. https://doi.org/10.1038/267585a0 |
[12] | Bahar, I., Atilgan, A.R. and Erman, B. (1997) Direct Evaluation of Thermal Fluctuations in Proteins Using a Single-Parameter Harmonic Potential. Folding and Design, 2, 173-181. https://doi.org/10.1016/S1359-0278(97)00024-2 |
[13] | Tian, F., Zhang, C., Fan, X., et al. (2010) Predicting the Flexibility Profile of Ribosomal RNAs. Molecular Informatics, 29, 707-715. https://doi.org/10.1002/minf.201000092 |
[14] | Guruge, I., Taherzadeh, G., Zhan, J., et al. (2018) B-Factor Profile Prediction for RNA Flexibility Using Support Vector Machines. Journal of Computational Chemistry, 39, 407-411. https://doi.org/10.1002/jcc.25124 |
[15] | Wei, H., Wang, B., Yang, J., et al. (2021) RNA Flexibility Prediction with Sequence Profile and Predicted Solvent Accessibility. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18, 2017-2022. https://doi.org/10.1109/TCBB.2019.2956496 |
[16] | Pun, C.S., Yong, B.Y.S. and Xia, K. (2020) Weighted-Persistent-Homology-Based Machine Learning for RNA Flexibility Analysis. PLOS ONE, 15, e237747. https://doi.org/10.1371/journal.pone.0237747 |
[17] | Nguyen, T., Le, N., Ho, Q., et al. (2019) Using Word Embedding Technique to Efficiently Represent Protein Sequences for Identifying Substrate Specificities of Transporters. Analytical Biochemistry, 577, 73-81. https://doi.org/10.1016/j.ab.2019.04.011 |
[18] | Goth, G. (2016) Deep or Shallow, NLP Is Breaking Out. Communications of the ACM, 59, 13-16. |
[19] | Solan, Z., Horn, D., Ruppin, E., et al. (2005) Unsupervised Learning of Natural Languages. Proceedings of the National Academy of Sciences of the United States of America, 102, 11629-11634. https://doi.org/10.1073/pnas.0409746102 |
[20] | Strait, B.J. and Dewey, T.G. (1996) The Shannon Information Entropy of Protein Sequences. Biophysical Journal, 71, 148-155. https://doi.org/10.1016/S0006-3495(96)79210-X |
[21] | Yu, L., Tanwar, D.K., Penha, E.D.S., et al. (2019) Grammar of Protein Domain Architectures. Proceedings of the National Academy of Sciences, 116, 3636-3645. https://doi.org/10.1073/pnas.1814684116 |
[22] | Ptitsyn, O.B. (1991) How Does Protein Synthesis Give Rise to the 3D-Structure? FEBS Letters, 285, 176-181. https://doi.org/10.1016/0014-5793(91)80799-9 |
[23] | Qiu, W., Lv, Z., Xiao, X., et al. (2021) EMCBOW-GPCR: A Method for Identifying G-Protein Coupled Receptors Based on Word Embedding and Wordbooks. Computational and Structural Biotechnology Journal, 19, 4961-4969. https://doi.org/10.1016/j.csbj.2021.08.044 |
[24] | Hamid, M. and Friedberg, I. (2019) Identifying Antimicrobial Peptides Using Word Embedding with Deep Recurrent Neural Networks. Bioinformatics, 35, 2009-2016. https://doi.org/10.1093/bioinformatics/bty937 |
[25] | Nguyen, T., Le, N., Ho, Q., et al. (2020) TNFPred: Identifying Tumor Necrosis Factors Using Hybrid Features Based on Word Embeddings. BMC Medical Genomics, 13, Article No. 155. https://doi.org/10.1186/s12920-020-00779-w |
[26] | Tomas, M., Kai, C., Greg, C., et al. (2013) Efficient Estimation of Word Representations in Vector Space. CoRR. arXiv preprint arXiv:1301.3781 |
[27] | Li, W. and Godzik, A. (2006) Cd-Hit: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. Bioinformatics, 22, 1658-1659. https://doi.org/10.1093/bioinformatics/btl158 |