Genome-wide epigenomic datasets allow us to validate the biological function of motifs and understand the regulatory mechanisms more comprehensively. How different motifs determine whether transcription factors (TFs) can bind to DNA at a specific position is a critical research question. In this project, we apply computational techniques that were used in Natural Language Processing (NLP) to predict the Transcription Factor Bound Regions (TFBRs) given motif instances. Most existing motif prediction methods using deep neural network apply base sequences with one-hot encoding as an input feature to realize TFBRs identification, contributing to low-resolution and indirect binding mechanisms. However, how the collective effect of motifs on binding sites is complicated to figure out. In our pipeline, we apply Word2Vec algorithm, with names of motifs as an input to predict TFBRs utilizing Convolutional Neural Network (CNN) to realize binary classification, based on the ENCODE dataset. In this regard, we consider different types of motifs as separate “words”, and their corresponding TFBR as the meanings of “sentences”. One “sentence” itself is merely the combination of these motifs, and all “sentences” compose of the whole “passage”. For each binding site, we do the binary classification within different cell types to show the performance of our model in different binding sites and cell types. Each “word” has a corresponding vector in high dimensions, and the distances between each vector can be figured out, so we can extract the similarity between each motif, and the explicit binding mechanism from our model. We apply Convolutional Neural Network (CNN) to extract features in the process of mapping and pooling from motif vectors extracted by Word2Vec Algorithm and gain the result of 87% accuracy at the peak.
References
[1]
Pique-Regi, R., Degner, J.F., Pai, A.A., Gaffney, D.J., Gilad, Y. and Pritchard, J.K. (2011) Accurate Inference of Transcription Factor Binding from DNA Sequence and Chromatin Accessibility Data. Genome Research, 21, 447-455.
https://doi.org/10.1101/gr.112623.110
[2]
Lanchantin, J., Singh, R., Wang, B. and Yanjun, Q.I. (2017) Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks. Proceedings of the Pacific Symposium.
[3]
Lanchantin, J., Singh, R., Wang, B. and Qi, Y. (2016) Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks. Pacific Symposium on Biocomputing, 22, 254-265.
https://doi.org/10.1142/9789813207813_0025
[4]
Gers, F.A. and Schmidhuber, E. (2001) LSTM Recurrent Networks Learn Simple Context-Free and Context-Sensitive Languages. IEEE Transactions on Neural Networks, 12, 1333-1340. https://doi.org/10.1109/72.963769
[5]
Angermueller, C., PRnamaa, T., Parts, L. and Stegle, O. (2016) Deep Learning for Computational Biology. Molecular Systems Biology, 12, 878.
https://doi.org/10.15252/msb.20156651
[6]
Li, C., Zhan, G. and Li, Z. (2018) News Text Classification Based on Improved Bi-LSTM-CNN. 2018 9th International Conference on Information Technology in Medicine and Education, Hangzhou, 19-21 October 2018, 890-893.
https://doi.org/10.1109/ITME.2018.00199
[7]
Alipanahi, B., Delong, A., Weirauch, M.T. and Frey, B.J. (2015) Predicting the Sequence Specificities of DNA- and RNA-Binding Proteins by Deep Learning. Nature Biotechnology, 33, 831-838. https://doi.org/10.1038/nbt.3300
[8]
Pouya, K. and Manolis, K. (2014) Systematic Discovery and Characterization of Regulatory Motifs in ENCODE TF Binding Experiments. Nucleic Acids Research, 42, 2976-2987. http://compbio.mit.edu/encode-motifs
https://doi.org/10.1093/nar/gkt1249
[9]
Tomas, M., Kai, C., Greg, C. and Jeffrey, D. (2013) Efficient Estimation of Word Representations in Vector Space. From Cornell University, arXiv:1301.3781.
[10]
Tomas, M., Kai, C., Greg, C., Ilya, S. and Jeffrey, D. (2013) Distributed Representations of Words and Phrases and Their Compositionality. From Cornell University, arXiv:1310.4546.