SCBI_MapReduce, a New Ruby Task-Farm Skeleton for Automated Parallelisation and Distribution in Chunks of Sequences: The Implementation of a Boosted Blast+
Current genomic analyses often require the managing and comparison of big data using desktop bioinformatic software that was not developed regarding multicore distribution. The task-farm SCBI_MAPREDUCE is intended to simplify the trivial parallelisation and distribution of new and legacy software and scripts for biologists who are interested in using computers but are not skilled programmers. In the case of legacy applications, there is no need of modification or rewriting the source code. It can be used from multicore workstations to heterogeneous grids. Tests have demonstrated that speed-up scales almost linearly and that distribution in small chunks increases it. It is also shown that SCBI_MAPREDUCE takes advantage of shared storage when necessary, is fault-tolerant, allows for resuming aborted jobs, does not need special hardware or virtual machine support, and provides the same results than a parallelised, legacy software. The same is true for interrupted and relaunched jobs. As proof-of-concept, distribution of a compiled version of BLAST+ in the SCBI_DISTRIBUTED_BLAST gem is given, indicating that other blast binaries can be used while maintaining the same SCBI_DISTRIBUTED_BLAST code. Therefore, SCBI_MAPREDUCE suits most parallelisation and distribution needs in, for example, gene and genome studies. 1. Introduction The study of genomes is undergoing a revolution: the production of an ever-growing amount of sequences increases year by year at a rate that outpaces computing performance [1]. This huge amount of sequences needs to be processed with the well-proven algorithms that will not run faster in new computer chips since around 2003 chipmakers discovered that they were no longer able to sustain faster sequential execution except for generating the multicore chips [2, 3]. Therefore, the only current way to obtain results in a timely manner is developing software dealing with multicore CPUs or clusters of multiprocessors. In such a context, “cloud computing” is becoming a cost-effective and powerful resource of multicore clusters for task distribution in bioinformatics [1, 2]. Sequence alignment and comparison are the most important topics in bioinformatic studies of genes and genomes. It is a complex process that tries to optimise sequence homology by means of sequence similarity using the algorithm of Needleman-Wunsch for global alignment, or the one of Smith-Waterman for local alignments. BLAST and FASTA [4] are the most widespread tools that have implemented them. Paired sequence comparison is inherently a parallel process in which many
References
[1]
C. Huttenhower and O. Hofmann, “A quick guide to large-scale genomic data mining,” PLoS Computational Biology, vol. 6, no. 5, Article ID e1000779, 2010.
[2]
M. C. Schatz, B. Langmead, and S. L. Salzberg, “Cloud computing and the DNA data race,” Nature Biotechnology, vol. 28, no. 7, pp. 691–693, 2010.
[3]
D. Patterson, “The trouble with multi-core,” IEEE Spectrum, vol. 47, no. 7, pp. 28–53, 2010.
[4]
C. Camacho, G. Coulouris, V. Avagyan et al., “BLAST+: architecture and applications,” BMC Bioinformatics, vol. 10, article 421, 2009.
[5]
S. Gálvez, D. Díaz, P. Hernández, F. J. Esteban, J. A. Caballero, and G. Dorado, “Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment,” Bioinformatics, vol. 26, no. 5, pp. 683–686, 2010.
[6]
H. Lin, X. Ma, W. Feng, and N. F. Samatova, “Coordinating computation and I/O in massively parallel sequence search,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 4, pp. 529–543, 2011.
[7]
T. Nguyen, W. Shi, and D. Ruden, “CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping,” BMC Research Notes, vol. 4, article 171, 2011.
[8]
T. Rognes, “Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation,” BMC Bioinformatics, vol. 12, article 221, 2011.
[9]
X.-L. Yang, Y.-L. Liu, C.-F. Yuan, and Y.-H. Huang, “Parallelization of BLAST with MapReduce for long sequence alignment,” in Proceedings of the 4th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP '11), pp. 241–246, IEEE Computer Society, December 2011.
[10]
B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg, “Searching for SNPs with cloud computing,” Genome Biology, vol. 10, no. 11, article R134, 2009.
[11]
M. Needham, R. Hu, S. Dwarkadas, and X. Qiu, “Hierarchical parallelization of gene differential association analysis,” BMC Bioinformatics, vol. 12, article 374, 2011.
[12]
M. K. Gardner, W.-C. Feng, J. Archuleta, H. Lin, and X. Mal, “Parallel genomic sequence-searching on an ad-hoc grid: experiences, lessons learned, and implications,” in Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, vol. 1, pp. 1–14, 2006.
[13]
L. Yu, C. Moretti, A. Thrasher, S. Emrich, K. Judd, and D. Thain, “Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions,” Cluster Computing, vol. 13, no. 3, pp. 243–256, 2010.
[14]
M. K. Chen and K. Olukotun, “The Jrpm system for dynamically parallelizing Java programs,” in Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA '03), pp. 434–445, San Diego, Calif, USA, June 2003.
[15]
P. Haller and M. Odersky, “Scala Actors: unifying thread-based and event-based programming,” Theoretical Computer Science, vol. 410, no. 2-3, pp. 202–220, 2009.
[16]
J. Armstrong, R. Virding, C. Wikstr?m, and M. Williams, Concurrent Programming in ERLANG, Prentice Hall, 2nd edition, 1996.
[17]
W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, Cambridge, Mass, USA, 2nd edition, 1999.
[18]
L. Dagum and R. Menon, “Openmp: an industry-standard api for shared-memory programming,” IEEEComputational Science & Engineering, vol. 5, no. 1, pp. 46–55, 1998.
[19]
Q. Zou, X.-B. Li, W.-R. Jiang, Z.-Y. Lin, G.-L. Li, and K. Chen, “Survey of mapreduce frame operation in bioinformatics,” Briefings in Bioinformatics. In press.
[20]
R. C. Taylor, “An overview of the hadoop/mapreduce/hbase framework and its current applications in bioinformatics,” BMC Bioinformatics, vol. 11, supplement 12, p. S1, 2010.
[21]
J. Lin, “Mapreduce is good enough?” Big Data, vol. 1, no. 1, pp. 28–37, 2013.
[22]
D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in practice: the Condor experience,” Concurrency Computation Practice and Experience, vol. 17, no. 2-4, pp. 323–356, 2005.
[23]
S. Pellicer, G. Chen, K. C. C. Chan, and Y. Pan, “Distributed sequence alignment applications for the public computing architecture,” IEEE Transactions on Nanobioscience, vol. 7, no. 1, pp. 35–43, 2008.
[24]
J. Hill, M. Hambley, T. Forster et al., “SPRINT: a new parallel framework for R,” BMC Bioinformatics, vol. 9, article 558, 2008.
[25]
J. Li, X. Ma, S. Yoginath, G. Kora, and N. F. Samatova, “Transparent runtime parallelization of the R scripting language,” Journal of Parallel and Distributed Computing, vol. 71, no. 2, pp. 157–168, 2011.
[26]
F. Berenger, C. Coti, and K. Y. J. Zhang, “PAR: a PARallel and distributed job crusher,” Bioinformatics, vol. 26, no. 22, pp. 2918–2919, 2010.
[27]
M. Aldinucci, M. Torquati, C. Spampinato et al., “Parallel stochastic systems biology in the cloud,” Briefings in Bioinformatics. In press.
[28]
A. Matsunaga, M. Tsugawa, and J. Fortes, “CloudBLAST: combining MapReduce and virtualization on distributed resources for bioinformatics applications,” in Proceedings of the 4th IEEE International Conference on eScience (eScience '08), pp. 222–229, IEEE Computer Society, Washington, DC, USA, December 2008.
[29]
W. Lu, J. Jackson, and R. Barga, “AzureBlast: a case study of developing science applications on the cloud,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10), pp. 413–420, ACM, Chicago, Ill, USA, June 2010.
[30]
P. D. Vouzis and N. V. Sahinidis, “GPU-BLAST: using graphics processors to accelerate protein sequence alignment,” Bioinformatics, vol. 27, no. 2, pp. 182–188, 2011.
[31]
C. S. Oehmen and D. J. Baxter, “Scalablast 2.0: rapid and robust blast calculations on multiprocessor systems,” Bioinformatics, vol. 29, no. 6, pp. 797–798, 2013.
[32]
J. Aerts and A. Law, “An introduction to scripting in Ruby for biologists,” BMC Bioinformatics, vol. 10, article 221, 2009.
[33]
S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai, “The impact of performance asymmetry in emerging multicore architectures.,” SIGARCH Computer Architecture News, vol. 33, no. 2, pp. 506–517, 2005.
[34]
L. Jostins and J. Jaeger, “Reverse engineering a gene network using an asynchronous parallel evolution strategy,” BMC Systems Biology, vol. 4, article 17, 2010.
[35]
O. Thorsen, B. Smith, C. P. Sosa et al., “Parallel genomic sequence-search on a massively parallel system,” in Proceedings of the 4th Conference on Computing Frontiers (CF '07), pp. 59–68, Ischia, Italy, May 2007.
[36]
M. Armbrust, A. Fox, R. Griffith et al., “A view of cloud computing,” Communications of the ACM, vol. 53, no. 4, pp. 50–58, 2010.
[37]
C.-L. Hung and Y.-L. Lin, “Implementation of a parallel protein structure alignment service on cloud,” International Journal of Genomics, vol. 2013, Article ID 439681, 8 pages, 2013.