Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science. 1. Introduction Biologists are expected to answer large-scale questions that address processes occurring across broad spatial and temporal scales, such as the effects of climate change on species [1, 2]. This motivates the development of a new type of data-driven discovery focusing on scientific insights and hypothesis generation through the novel management and analysis of preexisting data [3, 4]. Data-driven discovery presumes that a large, virtual pool of data will emerge across a wide spectrum of the life sciences, matching that already in place for the molecular sciences. It is argued that the availability of such a pool will allow biodiversity science to join the other “Big” (i.e., data-centric) sciences such as astronomy and high-energy particle physics [5]. Managing large amounts of heterogeneous data for this Big New Biology will require a cyberinfrastructure that organizes an open pool of biological data [6]. To assess the resources needed to establish the cyberinfrastructure for biology, it is necessary to understand the nature of biological data [4]. To become a part of the cyberinfrastructure, data must be ready to enter a digital data pool. This means data must be digital, normalized, and standardized [4]. Biological data sets are heterogeneous in format, size, degree of digitization, and openness [4, 7, 8]. The distribution of
References
[1]
B. Wuethrich, “How climate change alters rhythms of the wild,” Science, vol. 287, no. 5454, pp. 793–795, 2000.
[2]
W. E. Bradshaw and C. M. Holzapfel, “Genetic shift in photoperiodic response correlated with global warming,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 25, pp. 14509–14511, 2001.
[3]
National Academy of Sciences , “New biology for the 21st Century,” Frontiers in Ecology and the Environment, vol. 7, no. 9, article 455, 2009.
[4]
A. E. Thessen and D. J. Patterson, “Data issues in life science,” ZooKeys, vol. 150, pp. 15–51, 2011.
[5]
A. Hey, The Fourth Paradigm: Data-Intensive Scientific Discovery, 2009, http://iw.fh-potsdam.de/fileadmin/FB5/Dokumente/forschung/tagungen/i-science/TonyHey_-__eScience_Potsdam__Mar2010____complete_.pdf.
[6]
L. D. Stein, “Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges,” Nature Reviews Genetics, vol. 9, pp. 678–688, 2008.
[7]
P. B. Heidorn, “Shedding light on the dark data in the long tail of science,” Library Trends, vol. 57, no. 2, pp. 280–299, 2008.
[8]
Key Perspectives Ltd, “Data dimensions: disciplinary differences in research data sharing, reuse and long term viability,” Digital Curation Centre, 2010, http://scholar.google.com/scholar?hl=en&q=Data+Dimensions:+disciplinary+differences+in+research+data-sharing,+reuse+and+long+term+viability.++&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0.
[9]
A. Vollmar, J. A. Macklin, and L. Ford, “Natural history specimen digitization: challenges and concerns,” Biodiversity Informatics, vol. 7, no. 2, 2010.
[10]
P. N. Schofield, J. Eppig, E. Huala, et al., “Sustaining the data and bioresource commons,” Research Funding, vol. 330, no. 6004, pp. 592–593, 2010.
[11]
P. Groth, A. Gibson, and J. Velterop, “Anatomy of a Nanopublication,” Information Services & Use, vol. 30, no. 1-2, pp. 51–56, 2010.
[12]
M. Kalfatovic, “Building a global library of taxonomic literature,” in 28th Congresso Brasileiro de Zoologia Biodiversidade e Sustentabilidade, 2010, http://www.slideshare.net/Kalfatovic/building-a-global-library-of-taxonomic-literature.
[13]
X. Tang and P. Heidorn, “Using automatically extracted information in species page retrieval,” 2007, http://scholar.google.com/scholar?hl=en&q=Tang+Heidorn+2007+using+automatically+extracted&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0.
[14]
H. Cui, P. Selden, and D. Boufford, “Semantic annotation of biosystematics literature without training examples,” Journal of the American Society for Information Science and Technology, vol. 61, pp. 522–542, 2010.
[15]
A. Taylor, “Extracting knowledge from biological descriptions,” in Proceedings of 2nd International Conference on Building and Sharing Very Large-Scale Knowledge Bases, pp. 114–119, 1995.
[16]
H. Cui, “Competency evaluation of plant character ontologies against domain literature,” Journal of the American Society for Information Science and Technology, vol. 61, no. 6, pp. 1144–1165, 2010.
[17]
Y. Miyao, K. Sagae, R. S?tre, T. Matsuzaki, and J. Tsujii, “Evaluating contributions of natural language parsers to protein-protein interaction extraction,” Bioinformatics, vol. 25, no. 3, pp. 394–400, 2009.
[18]
K. Humphreys, G. Demetriou, and R. Gaizauskas, “Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '00), vol. 513, pp. 505–513, 2000.
[19]
R. Gaizauskas, G. Demetriou, P. J. Artymiuk, and P. Willett, “Protien structures and information extraction from biological texts: the pasta system,” Bioinformatics, vol. 19, no. 1, pp. 135–143, 2003.
[20]
A. Divoli and T. K. Attwood, “BioIE: extracting informative sentences from the biomedical literature,” Bioinformatics, vol. 21, no. 9, pp. 2138–2139, 2005.
[21]
D. P. A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones, “BioRAT: extracting biological information from full-length papers,” Bioinformatics, vol. 20, no. 17, pp. 3206–3213, 2004.
[22]
H. Chen and B. M. Sharp, “Content-rich biological network constructed by mining PubMed abstracts,” Bmc Bioinformatics, vol. 5, article 147, 2004.
[23]
X. Zhou, X. Zhang, and X. Hu, “Dragon toolkit: incorporating auto-learned semantic knowledge into large-scale text retrieval and mining,” in Proceedings of the19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI '07), pp. 197–201, October 2007.
[24]
D. Rebholz-Schuhmann, H. Kirsch, M. Arregui, S. Gaudan, M. Riethoven, and P. Stoehr, “EBIMed—text crunching to gather facts for proteins from Medline,” Bioinformatics, vol. 23, no. 2, pp. e237–e244, 2007.
[25]
Z. Z. Hu, I. Mani, V. Hermoso, H. Liu, and C. H. Wu, “iProLINK: an integrated protein resource for literature mining,” Computational Biology and Chemistry, vol. 28, no. 5-6, pp. 409–416, 2004.
[26]
J. Demaine, J. Martin, L. Wei, and B. De Bruijn, “LitMiner: integration of library services within a bio-informatics application,” Biomedical Digital Libraries, vol. 3, article 11, 2006.
[27]
M. Lease and E. Charniak, “Parsing biomedical literature,” in Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP '05), Jeju Island, Korea, 2005.
[28]
S. Pyysalo and T. Salakoski, “Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches,” BMC Bioinformatics, vol. 7, supplement 3, article S2, 2006.
[29]
L. Rimell and S. Clark, “Porting a lexicalized-grammar parser to the biomedical domain,” Journal of Biomedical Informatics, vol. 42, no. 5, pp. 852–8865, 2009.
[30]
H. Cui, “Converting taxonomic descriptions to new digital formats,” Biodiversity Informatics, vol. 5, pp. 20–40, 2008.
[31]
D. Koning, I. N. Sarkar, and T. Moritz, “TaxonGrab: extracting taxonomic names from text,” Biodiversity Informatics, vol. 2, pp. 79–82, 2005.
[32]
L. M. Akella, C. N. Norton, and H. Miller, “NetiNeti: discovery of scientific names from text using machine learning methods,” 2011.
[33]
M. Gerner, G. Nenadic, and C. M. Bergman, “LINNAEUS: a species name identification system for biomedical literature,” BMC Bioinformatics, vol. 11, article 85, 2010.
[34]
N. Naderi and T. Kappler, “OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents,” Bioinformatics, vol. 27, no. 19, pp. 2721–2729, 2011.
[35]
R. Abascal and J. A. Sánchez, “X-tract: structure extraction from botanical textual descriptions,” in Proceeding of the String Processing & Information Retrieval Symposium & International Workshop on Groupware, pp. 2–7, IEEE Computer Society, Cancun , Mexico, September 1999.
[36]
H. Cui, “CharaParser for fine-grained semantic annotation of organism morphological descriptions,” Journal of the American Society for Information Science and Technology, vol. 63, no. 4, pp. 738–754, 2012.
[37]
M. Krauthammer, A. Rzhetsky, P. Morozov, and C. Friedman, “Using BLAST for identifying gene and protein names in journal articles,” Gene, vol. 259, no. 1-2, pp. 245–252, 2000.
[38]
L. Lenzi, F. Frabetti, F. Facchin, et al., “UniGene tabulator: a full parser for the UniGene format,” Bioinformatics, vol. 22, no. 20, pp. 2570–2571, 2006.
[39]
A. Nasr and O. Rambow, “Supertagging and full parsing,” in Proceedings of the 7th International Workshop on Tree Adjoining Grammar and Related Formalisms (TAG '04), 2004.
[40]
R. Leaman and G. Gonzalez, “BANNER: an executable survey of advances in biomedical named entity recognition,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '08), pp. 652–663, Kona, Hawaii, USA, January 2008.
[41]
M. Schr?der, “Knowledge-based processing of medical language: a language engineering approach,” in Proceedings of the16th German Conference on Artificial Intelligence (GWAI '92), vol. 671, pp. 221–234, Bonn, Germany, August-September 1992.
[42]
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 2nd edition, 2005.
[43]
C. Blaschke, L. Hirschman, and A. Valencia, “Information extraction in molecular biology,” Briefings in Bioinformatics, vol. 3, no. 2, pp. 154–165, 2002.
[44]
A. Jimeno-Yepes and A. R. Aronson, “Self-training and co-training in biomedical word sense disambiguation,” pp. 182–183.
[45]
C. Freeland, “An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments,” Nature Precedings, 2009, http://precedings.nature.com/documents/3372/version/1.
[46]
A. Kornai, “Experimental hmm-based postal ocr system,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), vol. 4, pp. 3177–3180, April 1997.
[47]
A. Kornai, K. Mohiuddin, and S. D. Connell, “Recognition of cursive writing on personal checks,” in Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition, pp. 373–378, Citeseer, Essex, UK, 1996.
[48]
C. Freeland, “Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing.,” in BioSystematics Berlin, 2011, http://www.slideshare.net/chrisfreeland/digitization-and-enhancement-of-biodiversity-literature-through-ocr-scientific-names-mapping-and-crowdsourcing.
[49]
A. Willis, D. King, D. Morse, A. Dil, C. Lyal, and D. Roberts, “From XML to XML: the why and how of making the biodiversity literature accessible to researchers,” in Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC '10), pp. 1237–1244, European Language Resources Association (ELRA), Valletta, Malta, May 2010.
[50]
F. Bapst and R. Ingold, “Using typography in document image analysis,” in Proceedings of Raster Imaging and Digital Typography (RIDT '98), pp. 240–251, Saint-Malo, France, March-April 1998.
[51]
A. L. Weitzman and C. H. C. Lyal, An XML Schema for Taxonomic Literature—TaXMLit, 2004, http://www.sil.si.edu/digitalcollections/bca/documentation/taXMLitv1-3Intro.pdf.
[52]
T. Rees, “TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential applications in taxonomic databases,” in Proceedings of TDWG, 2008, pp. 35, http://www.tdwg.org/fileadmin/2008conference/documents/Proceedings2008.pdf#page=35.
[53]
G. Sautter, K. B?hm, and D. Agosti, “Semi-automated xml markup of biosystematic legacy literature with the goldengate editor,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '07), pp. 391–402, World Scientific, 2007.
[54]
B. Settles, “ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text,” Bioinformatics, vol. 21, no. 14, pp. 3191–3192, 2005.
[55]
G. A. Pavlopoulos, E. Pafilis, M. Kuhn, S. D. Hooper, and R. Schneider, “OnTheFly: a tool for automated document-based text annotation, data linking and network generation,” Bioinformatics, vol. 25, no. 7, pp. 977–978, 2009.
[56]
E. Pafilis, S. I. O'Donoghue, L. J. Jensen et al., “Reflect: augmented browsing for the life scientist,” Nature Biotechnology, vol. 27, no. 6, pp. 508–510, 2009.
[57]
M. Kuhn, C. von Mering, M. Campillos, L. J. Jensen, and P. Bork, “STITCH: interaction networks of chemicals and proteins,” Nucleic Acids Research, vol. 36, no. 1, pp. D684–D688, 2008.
[58]
J. P. Balhoff, W. M. Dahdul, C. R. Kothari et al., “Phenex: ontological annotation of phenotypic diversity,” Plos ONE, vol. 5, no. 5, article e10500, 2010.
[59]
W. M. Dahdul, J. P. Balhoff, J. Engeman et al., “Evolutionary characters, phenotypes and ontologies: curating data from the systematic biology literature,” Plos ONE, vol. 5, no. 5, Article ID e10708, 2010.
[60]
G. Sautter, K. Bohm, and D. Agosti, “A combining approach to find all taxon names (FAT) in legacy biosystematics literature,” Biodiversity Informatics, vol. 3, pp. 46–58, 2007.
[61]
P. R. Leary, D. P. Remsen, C. N. Norton, D. J. Patterson, and I. N. Sarkar, “UbioRSS: tracking taxonomic literature using RSS,” Bioinformatics, vol. 23, no. 11, pp. 1434–1436, 2007.
[62]
N. Okazaki and S. Ananiadou, “Building an abbreviation dictionary using a term recognition approach,” Bioinformatics, vol. 22, no. 24, pp. 3089–3095, 2006.
[63]
K. Bontcheva, V. Tablan, D. Maynard, and H. Cunningham, “Evolving gate to meet new challenges in language engineering,” Natural Language Engineering, vol. 10, no. 3-4, pp. 349–373, 2004.
[64]
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, C. Ursu, et al., Developing Language Processing Components with GATE (A User Guide), University of Sheffield, 2006.
[65]
E. Fitzpatrick, J. Bachenko, and D. Hindle, “The status of telegraphic sublanguages,” in Analyzing Language in Restricted Domains: Sublanguage Description and Processing, pp. 39–51, 1986.
[66]
M. Wood, S. Lydon, V. Tablan, D. Maynard, and H. Cunningham, “Populating a database from parallel texts using ontology-based information extraction,” in Natural Language Processing and Information Systems, vol. 3136, pp. 357–365, 2004.
[67]
L. Chen, H. Liu, and C. Friedman, “Gene name ambiguity of eukaryotic nomenclatures,” Bioinformatics, vol. 21, no. 2, pp. 248–256, 2005.
[68]
H. Yu, W. Kim, V. Hatzivassiloglou, and W. J. Wilbur, “Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles,” Journal of Biomedical Informatics, vol. 40, no. 2, pp. 150–159, 2007.
[69]
J. T. Chang and H. Schutze, “Abbreviations in biomedical text,” in Text Mining for Biology and Biomedicine, pp. 99–119, 2006.
[70]
J. D. Wren and H. R. Garner, “Heuristics for identification of acronym-definition patterns within text: towards an automated construction of comprehensive acronym-definition dictionaries,” Methods of Information in Medicine, vol. 41, no. 5, pp. 426–434, 2002.
[71]
S. Lydon and M. Wood, “Data patterns in multiple botanical descriptions: implications for automatic processing of legacy data,” Systematics and Biodiversity, vol. 1, no. 2, pp. 151–157, 2003.
[72]
A. Taylor, “Using prolog for biological descriptions,” in Proceedings of The 3rd international Conference on the Practical Application of Prolog, pp. 587–597, 1995.
[73]
A. E. Radford, Fundamentals of Plant Systematics, Harper & Row, New York, NY, USA, 1986.
[74]
J. Diederich, R. Fortuner, and J. Milton, “Computer-assisted data extraction from the taxonomical literature,” 1999, http://math.ucdavis.edu/~milton/genisys.html.
[75]
M. Wood, S. Lydon, V. Tablan, D. Maynard, and H. Cunningham, “Using parallel texts to improve recall in IE,” in Proceedings of Recent Advances in Natural Language Processing (RANLP '03), pp. 505–512, Borovetz, Bulgaria, 2003.
[76]
H. Cui and P. B. Heidorn, “The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions,” Journal of the American Society for Information Science and Technology, vol. 58, no. 1, pp. 133–149, 2007.
[77]
Q. Wei, Information fusion in taxonomic descriptions, Ph.D. thesis, University of Illinois at Urbana-Champaign, Champaign, Ill, USA, 2011.
[78]
S. Soderland, “Learning information extraction rules for semi-structured and free text,” Machine Learning, vol. 34, no. 1, pp. 233–272, 1999.
[79]
H. Cui, S. Singaram, and A. Janning, “Combine unsupervised learning and heuristic rules to annotate morphological characters,” Proceedings of the American Society for Information Science and Technology, vol. 48, no. 1, pp. 1–9, 2011.
[80]
P. M. Mabee, M. Ashburner, Q. Cronk et al., “Phenotype ontologies: the bridge between genomics and evolution,” Trends in Ecology and Evolution, vol. 22, no. 7, pp. 345–350, 2007.