全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Literature Retrieval and Mining in Bioinformatics: State of the Art and Challenges

DOI: 10.1155/2012/573846

Full-Text   Cite this paper   Add to My Lib

Abstract:

The world has widely changed in terms of communicating, acquiring, and storing information. Hundreds of millions of people are involved in information retrieval tasks on a daily basis, in particular while using a Web search engine or searching their e-mail, making such field the dominant form of information access, overtaking traditional database-style searching. How to handle this huge amount of information has now become a challenging issue. In this paper, after recalling the main topics concerning information retrieval, we present a survey on the main works on literature retrieval and mining in bioinformatics. While claiming that information retrieval approaches are useful in bioinformatics tasks, we discuss some challenges aimed at showing the effectiveness of these approaches applied therein. 1. Introduction Nowadays, most of the scientific publications are electronically available on the Web, making the problem of retrieving and mining documents and data a challenging task. To this end, automated document management systems have gained a main role in the field of intelligent information access [1]. Thus, research and development in the area of bioinformatics literature retrieval and mining is aimed at providing intelligent and personalized services to biologists and bioinformaticians while searching for useful information in scientific publications. In particular, the main goal of bioinformatics text analysis is to provide access to unstructured knowledge by improving searches, providing automatically generated summaries, linking publications with structured resources, visualizing contents for better understanding, and guiding researchers to formulate novel hypotheses and to discover knowledge. In the literature, several methods, systems, and tools to retrieve and mine bioinformatics publications have been proposed and adopted, some of them being currently available on the Web. In this paper, we provide a survey of existing end-user-oriented literature retrieval and/or mining solutions for bioinformatics, together with a short discussion on open challenges. The rest of the paper is organized as follows: Section 2 illustrates the main topics addressed in this paper, that is, information retrieval, text mining, and literature retrieval and mining. In Section 3, the state of the art on literature retrieval and mining in bioinformatics is presented. Section 4 discusses some relevant open problems and challenges. Section 5 ends the paper. 2. Background Supporting users in handling the huge and widespread amount of Web information is becoming a primary

References

[1]  G. Armano, M. de Gemmis, G. Semeraro, and E. Vargiu, Intelligent Information Access, vol. SCI 301 of Studies in Computational Intelligence, Springer, Heidelberg, Germany, 2010.
[2]  R. A. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman, Boston, Mass, USA, 1999.
[3]  M. Kobayashi and K. Takeda, “Information retrieval on the web,” ACM Computing Surveys, vol. 32, no. 2, pp. 165–173, 2000.
[4]  S. Bao, G. Xue, X. Wu, Y. Yu, B. Fei, and Z. Su, “Optimizing web search using social annotations,” in 16th International World Wide Web Conference (WWW '07), pp. 501–510, New York, NY, USA, May 2007.
[5]  C. D. Manning, P. Raghavan, and H. Schtze, Introduction to Information Retrieval, Cambridge University Press, New York, NY, USA, 2008.
[6]  J. Mayfield and T. Finin, “Information retrieval on the semantic web: Integrating inference and retrieval,” in Proceedings of the SIGIR Workshop on the Semantic Web, August 2003.
[7]  M. W. Berry, Survey of Text Mining, Springer, New York, NY, USA, 2003.
[8]  F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
[9]  I. Mani, Automatic Summarization, John Benjamins, Amsterdam, The Netherlands, 2001.
[10]  H. Luhn, “The automatic creation of literature abstracts,” IBM Journal of Research and Development, vol. 2, pp. 159–165, 1958.
[11]  P. Baxendale, “Machine-made index for technical literature—an experiment,” IBM Journal of Research and Development, vol. 2, pp. 354–361, 1958.
[12]  H. P. Edmundson, “New methods in automatic extracting,” Journal of ACM, vol. 16, pp. 264–285, 1969.
[13]  A. Nenkova, “Automatic text summarization of newswire: lessons learned from the document understanding conference,” in Proceedings of the 20th National Conference on Artificial Intelligence, vol. 3, pp. 1436–1441, AAAI Press, 2005.
[14]  G. Armano, A. Giuliani, and E. Vargiu, “Studying the impact of text summarization on contextual advertising,” in Proceedings of the 8th International Workshop on Text-based Information Retrieval, 2011.
[15]  A. Ko?cz, V. Prabakarmurthi, and J. Kalita, “Summarization as feature selection for text categorization,” in Proceedings of the 10th ACM International Conference on Information and Knowledge Management (CIKM '01), pp. 365–370, New York, NY, USA, November 2001.
[16]  J. Martin, “Clustering full text documents,” in Proceedings of the Workshop on Data Engineering for Inductive Learning at (IJCAI '95), 1995.
[17]  P. Willett, “Recent trends in hierarchic document clustering: a critical review,” Information Processing and Management, vol. 24, no. 5, pp. 577–597, 1988.
[18]  C. Aone, S. W. Bennett, and J. Gorlinsky, “Multi-media fusion through application of machine learning and nlp,” in AAAI Spring Symposium Working Notes on Machine Learning in Information Access, 1996.
[19]  D. H. Fisher, “Knowledge acquisition via incremental conceptual clustering,” Machine Learning, vol. 2, no. 2, pp. 139–172, 1987.
[20]  R. Liere and P. Tadepalli, “Active learning with committees for text categorization,” in Proceedings of the 14th National Conference on Artificial Intelligence (AAAI '97), pp. 591–596, July 1997.
[21]  P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman, Readings in Knowledge Acquisition and Learning, chap. AutoClass: A Bayesian Classification System, Morgan Kaufmann, San Francisco, Calif, USA, 1993.
[22]  C. Green and P. Edwards, “Using machine learning to enhance software tools for internet information management,” in Proceedings of the AAAI Workshop on Internetbased Information Systems, pp. 48–55, 1996.
[23]  D. E. Appelt, “Introduction to information extraction,” AI Communications, vol. 12, no. 3, pp. 161–172, 1999.
[24]  S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert, “Crystal inducing a conceptual dictionary,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2, pp. 1314–1319, Morgan Kaufmann, San Francisco, Calif, USA, 1995.
[25]  S. B. Huffman, “Learning information extraction patterns from examples,” in Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pp. 246–260, Springer, London, UK, 1996.
[26]  M. E. Califf and R. J. Mooney, “Relational learning of pattern-match rules for information extraction,” in Proceedings of the 16th National Conference on Artificial Intelligence (AAAI '99), 11th Innovative Applications of Artificial Intelligence Conference (IAAI '99), pp. 328–334, July 1999.
[27]  D. Freitag, “Machine learning for information extraction in informal domains,” Machine Learning, vol. 39, pp. 169–202, 2000.
[28]  K. D. Bollacker, S. Lawrence, and C. L. Giles, “Discovering relevant scientific literature on the Web,” IEEE Intelligent Systems and Their Applications, vol. 15, no. 2, pp. 42–47, 2000.
[29]  S. M. McNee, I. Albert, D. Cosley et al., “On the recommending of citations for research papers,” in Proceedings of the 8th Conference on Computer Supported Cooperative Work (CSCW '02), pp. 116–125, New York, NY, USA, November 2002.
[30]  W. C. Janssen and K. Popat, “UpLib: a universal personal digital library system,” in Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 234–242, fra, November 2003.
[31]  F. Mahdavi, M. A. Ismail, and N. Abdullah, “Semi-automatic trend detection in scholarly repository using semantic approach,” in Proceedings of the World Academy of Science, Engineering and Technology, pp. 224–226, Amsterdam, The Netherlands, 2009.
[32]  N. Cannata, E. Merelli, and R. B. Altman, “Erratum: time to organize the bioinformatics resourceome,” PLoS Computational Biology, vol. 2, no. 2, p. 112, 2006.
[33]  A. K. Bajpai, S. Davuluri, H. Haridas, et al., “In search of the right literature search engine(s),” Nature Preceding, 2011.
[34]  H. Yu, T. Kim, J. Oh, I. Ko, S. Kim, and W. S. Han, “Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS,” BMC Bioinformatics, vol. 11, supplement 2, p. S6, 2010.
[35]  J. F. Fontaine, A. Barbosa-Silva, M. Schaefer, M. R. Huska, E. M. Muro, and M. A. Andrade-Navarro, “MedlineRanker: flexible ranking of biomedical literature,” Nucleic Acids Research, vol. 37, no. 2, pp. W141–W146, 2009.
[36]  J. Wang, I. Cetindil, S. Ji et al., “Interactive and fuzzy search: a dynamic way to explore MEDLINE,” Bioinformatics, vol. 26, no. 18, Article ID btq414, pp. 2321–2327, 2010.
[37]  R. Herbrich, T. Graepel, and K. Obermayer, “Large margin rank boundaries for ordinal regression,” in Advances in Large Margin Classifiers, Smola B. and Schoelkopf S., Eds., MIT Press, Cambridge, Mass, USA, 2000.
[38]  J. Lewis, S. Ossowski, J. Hicks, M. Errami, and H. R. Garner, “Text similarity: an alternative way to search MEDLINE,” Bioinformatics, vol. 22, no. 18, pp. 2298–2304, 2006.
[39]  P. Coppernoll-Blach, “Quertle: the conceptual relationships alternative search engine for pubmed,” Journal of Medical Library Association, vol. 99, no. 2, pp. 176–177, 2011.
[40]  A. Doms and M. Schroeder, “GoPubMed: exploring PubMed with the gene ontology,” Nucleic Acids Research, vol. 33, no. 2, pp. W783–W786, 2005.
[41]  C. Perez-Iratxeta, A. J. Pérez, P. Bork, and M. A. Andrade, “Update on XplorMed: a web server for exploring scientific literature,” Nucleic Acids Research, vol. 31, no. 13, pp. 3866–3868, 2003.
[42]  D. Rebholz-Schuhmann, H. Kirsch, M. Arregui, S. Gaudan, M. Riethoven, and P. Stoehr, “EBIMed—text crunching to gather facts for proteins from Medline,” Bioinformatics, vol. 23, no. 2, pp. e237–e244, 2007.
[43]  R. Hoffmann and A. Valencia, “A gene network for navigating the literature,” Nature Genetics, vol. 36, no. 7, p. 664, 2004.
[44]  L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter, and J. N. Weinstein, “MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling,” BioTechniques, vol. 27, no. 6, pp. 1210–1217, 1999.
[45]  T. Greenhalgh, “How to read a paper. The medline database,” BMJ, vol. 315, no. 7101, pp. 180–183, 1997.
[46]  A. S. Yeh, L. Hirschman, and A. A. Morgan, “Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup,” Bioinformatics, vol. 19, pp. i331–339, 2003.
[47]  J. Thomas, D. Milward, C. Ouzounis, S. Pulman, and M. Carroll, “Automatic extraction of protein interactions from scientific abstracts,” Pacific Symposium on Biocomputing, pp. 541–552, 2000.
[48]  M. He, Y. Wang, and W. Li, “PPI finder: a mining tool for human protein-protein interactions,” PLoS ONE, vol. 4, no. 2, Article ID e4554, 2009.
[49]  M. Berardi, D. Malerba, R. Piredda, M. Attimonelli, G. Scioscia, and P. Leo, 16 Biomedical Literature Mining for Biological Databases Annotation, 2008.
[50]  M. Craven and J. Kumlien, “Constructing biological knowledge bases by extracting information from text sources,” in Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, pp. 77–86, 1999.
[51]  D. R. Swanson and N. R. Smalheiser, “An interactive system for finding complementary literatures: a stimulus to scientific discovery,” Artificial Intelligence, vol. 91, no. 2, pp. 183–203, 1997.
[52]  I. Donaldson, J. Martin, B. de Bruijn et al., “PreBIND and Textomy—mining the biomedical literature for protein-protein interactions using a support vector machine,” BMC Bioinformatics, vol. 4, no. 1, p. 11, 2003.
[53]  T. C. Wiegers, A. P. Davis, K. B. Cohen, L. Hirschman, and C. J. Mattingly, “Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics Database (CTD),” BMC Bioinformatics, vol. 10, article 326, 2009.
[54]  B. Kemper, T. Matsuzaki, Y. Matsuoka et al., “PathText: a text mining integrator for biological pathway visualizations,” Bioinformatics, vol. 26, no. 12, Article ID btq221, pp. i374–i381, 2010.
[55]  K. Oda, J. D. Kim, T. Ohta et al., “New challenges for text mining: mapping between text and manually curated pathways,” BMC Bioinformatics, vol. 9, supplement 3, p. S5, 2008.
[56]  M. J. Herrg?rd, N. Swainston, P. Dobson et al., “A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology,” Nature Biotechnology, vol. 26, no. 10, pp. 1155–1160, 2008.
[57]  I. Karamanis, R. Lewi, R. D. Seal, and B. E, “Integrating natural language processing with flybase curation,” in Proceedings of the Pacific Symposium on Biocomputing, pp. 245–256, Maui, Hawaii, USA, 2007.
[58]  S. Kiritchenko, S. Matwin, and A. F. Famili, “Hierarchical text categorization as a tool of associating genes with gene ontology codes,” in Proceedings of the 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, pp. 26–30, 2004.
[59]  D. R. Swanson, “Fish oil, Raynaud's syndrome, and undiscovered public knowledge,” Perspectives in Biology and Medicine, vol. 30, no. 1, pp. 7–18, 1986.
[60]  P. Bruza and M. Weeber, Literature-based Discovery, vol. 15, Springer, Heidelberg, Germany, 2008.
[61]  D. Hristovski, B. Peterlin, J. A. Mitchell, and S. M. Humphrey, “Using literature-based discovery to identify disease candidate genes,” International Journal of Medical Informatics, vol. 74, no. 2–4, pp. 289–298, 2005.
[62]  L. Chen and C. Friedman, “Extracting phenotypic information from the literature via natural language processing,” Medinfo, vol. 11, no. 2, pp. 758–762, 2004.
[63]  P. Srinivasan and T. Rindflesch, “Exploring text mining from MEDLINE,” Proceedings of the AMIA Symposium, pp. 722–726, 2002.
[64]  D. P. A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones, “BioRAT: extracting biological information from full-length papers,” Bioinformatics, vol. 20, no. 17, pp. 3206–3213, 2004.
[65]  D. Cheng, C. Knox, N. Young, P. Stothard, S. Damaraju, and D. S. Wishart, “PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites,” Nucleic Acids Research, vol. 36, pp. W399–W405, 2008.
[66]  S. Ananiadou, D. B. Kell, and J. I. Tsujii, “Text mining and its potential applications in systems biology,” Trends in Biotechnology, vol. 24, no. 12, pp. 571–579, 2006.
[67]  A. Nikitin, S. Egorov, N. Daraselia, and I. Mazo, “Pathway studio—the analysis and navigation of molecular networks,” Bioinformatics, vol. 19, no. 16, pp. 2155–2157, 2003.
[68]  J. Hur, A. D. Schuyler, D. J. States, and E. L. Feldman, “SciMiner: web-based literature mining tool for target identification and functional enrichment analysis,” Bioinformatics, vol. 25, no. 6, pp. 838–840, 2009.
[69]  R. Jelier, M. J. Schuemie, A. Veldhoven, L. C. J. Dorssers, G. Jenster, and J. A. Kors, “Anni 2.0: a multipurpose text-mining tool for the life sciences,” Genome Biology, vol. 9, no. 6, article R96, 2008.
[70]  M. Schuemie, R. Jelier, and J. K. J, “Peregrine: lightweight gene name normalization by dictionary lookup,” in Proceedings of the 2nd BioCreative Challenge Evaluation Workshop, pp. 131–140, 2007.
[71]  Y. Tsuruoka, J. Tsujii, and S. Ananiadou, “Facta: a text search engine for finding associated biomedical concepts,” Bioinformatics, vol. 24, no. 21, pp. 2559–2560, 2008.
[72]  M. Weeber, H. Klein, A. R. Aronson, J. G. Mork, L. T. de Jong-van den Berg, and R. Vos, “Msc: Text-based discovery in biomedicine: the architecture of the DAD-system,” Proceedings of the AMIA, the Annual Conference of the American Medical Informatics Association, pp. 903–907, 2000.
[73]  S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman, “Indexing by latent semantic analysis,” JASIS, vol. 41, no. 6, pp. 391–407, 1990.
[74]  G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
[75]  M. Krauthammer and G. Nenadic, “Term identification in the biomedical literature,” Journal of Biomedical Informatics, vol. 37, no. 6, pp. 512–526, 2004.
[76]  P. K. Shah, C. Perez-Iratxeta, P. Bork, and M. A. Andrade, “Information extraction from full text scientific articles: where are the keywords?” BMC Bioinformatics, vol. 4, article no. 20, 2003.
[77]  L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, “Overview of BioCreAtIvE: critical assessment of information extraction for biology,” BMC Bioinformatics, vol. 6, no. 1, article S1, 2005.
[78]  M. Krallinger, A. Morgan, L. Smith et al., “Evaluation of text-mining systems for biology: overview of the second biocreative community challenge,” Genome Biology, vol. 9, no. 2, article no. S1, 2008.

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413