|
Data Parallelism for Large-scale Distributed ComputingKeywords: Parallelism , Large-scale , Distributed Abstract: Large-scale computing systems are attractive for networked applications by providing scalable infrastructures. To launch distributed data-intensive computing applications in such infrastructures, communication cost, for example to transfer data files to compute nodes, can be a critical challenge due to point-to-point bandwidth scarcity. One way to improve communication performance is to employ parallelism in data retrieval. In this paper, we consider data parallelism for large-scale, data-intensive computing. Our approach is to utilize multiple replica servers in parallel for data retrieval. To improve performance and fault tolerance, we present a new parallel data retrieval algorithm based on a replicated retrieval of slowdown blocks. Then, we explore a broad set of resource selection techniques to identify computation nodes that have good download performance to data servers for given jobs. Our experimental results using trace data collected from PlanetLab show the benefits of our approach in large-scale, failure-prone environments.
|