SCBI_MapReduce, a New Ruby Task-Farm Skeleton for Automated Parallelisation and Distribution in Chunks of Sequences: The Implementation of a Boosted Blast+

DOI: 10.1155/2013/707540

Current genomic analyses often require the managing and comparison of big data using desktop bioinformatic software that was not developed regarding multicore distribution. The task-farm SCBI_MAPREDUCE is intended to simplify the trivial parallelisation and distribution of new and legacy software and scripts for biologists who are interested in using computers but are not skilled programmers. In the case of legacy applications, there is no need of modification or rewriting the source code. It can be used from multicore workstations to heterogeneous grids. Tests have demonstrated that speed-up scales almost linearly and that distribution in small chunks increases it. It is also shown that SCBI_MAPREDUCE takes advantage of shared storage when necessary, is fault-tolerant, allows for resuming aborted jobs, does not need special hardware or virtual machine support, and provides the same results than a parallelised, legacy software. The same is true for interrupted and relaunched jobs. As proof-of-concept, distribution of a compiled version of BLAST+ in the SCBI_DISTRIBUTED_BLAST gem is given, indicating that other blast binaries can be used while maintaining the same SCBI_DISTRIBUTED_BLAST code. Therefore, SCBI_MAPREDUCE suits most parallelisation and distribution needs in, for example, gene and genome studies. 1. Introduction The study of genomes is undergoing a revolution: the production of an ever-growing amount of sequences increases year by year at a rate that outpaces computing performance [1]. This huge amount of sequences needs to be processed with the well-proven algorithms that will not run faster in new computer chips since around 2003 chipmakers discovered that they were no longer able to sustain faster sequential execution except for generating the multicore chips [2, 3]. Therefore, the only current way to obtain results in a timely manner is developing software dealing with multicore CPUs or clusters of multiprocessors. In such a context, “cloud computing” is becoming a cost-effective and powerful resource of multicore clusters for task distribution in bioinformatics [1, 2]. Sequence alignment and comparison are the most important topics in bioinformatic studies of genes and genomes. It is a complex process that tries to optimise sequence homology by means of sequence similarity using the algorithm of Needleman-Wunsch for global alignment, or the one of Smith-Waterman for local alignments. BLAST and FASTA [4] are the most widespread tools that have implemented them. Paired sequence comparison is inherently a parallel process in which many


