Liu X,Smelyanskiy M,Chow E,et al.Efficient sparse matrix-vector multiplication on x86-based many-core processors[A].Proceedings of the 27th International Conference on Supercomputing [C].ACM,2013.273-282.
[2]
Saini S,Jin H,Jespersen D,et al.An early performance evaluation of many integrated core architecture based SGI rackable computing system [A].Proceedings of the 2013 International Conference for High Performance Computing,Networking,Storage and Analysis [C].ACM,2013.94.
[3]
Jeffers J,Reinders J.Intel Xeon Phi Coprocessor High Performance Programming[M].Newnes,2013.
[4]
Owens J D,Luebke D,Govindaraju N,et al.A survey of general purpose computation on graphics hardware[J] Computer Graphics Forum,2007,26(1):80-113.
[5]
王海峰,陈庆奎.图形处理器通用计算关键技术研究综述[J].计算机学报,2013,36(4):757-772. WANG Hai-Feng,CHEN Qing-Kui.General purpose computing ofgraphics processing unit:a survey[J].Chinese Journal of Computers,2013,36(4):757-772.(in Chinese)
[6]
王蕾,等.任务并行编程模型研究与进展[J].软件学报,2013,24(1):77-90. Wang L,et al.Research on task parallel programming model[J].Journal of Software,2013,24(1):77-90.(in Chinese)
[7]
Lee S,Min S J,Eigenmann R.OpenMP to GPGPU:a compiler framework for automatic translation and optimization[J].ACM Sigplan Notices,2009,44(4):101-110.
[8]
Lee S,Eigenmann R.OpenMPC:extended openMP programming and tuning for GPUs[A].Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis[C].IEEE,2010.1-11.
[9]
张保,等.CPU-GPU 系统中基于剖分的全局性能优化方法[J].西安交通大学学报,2012,46(2):17-23. Zhang Bao,et al.Profiling based optimization method for CPU-GPU heterogeneous parallel processing system[J].Journal of Xi''an Jiaotong University,2012,46(2):17-23.(in Chinese)
[10]
Wang P H,Collins J D,Chinya G N,et al.EXOCHI:architecture and programming environment for a heterogeneous multi-core multithreaded system[A].Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation[C].ACM,2007.156-166.
[11]
Luk C K,Hong S,Kim H.Qilin:exploiting parallelism on heterogeneous multi-processors with adaptive mapping[A].Proceedings of the the 42nd Annual IEEE/ACM International Symposium on Microarchitecture[C].IEEE,2009.45-55.
[12]
Jablin T B,Prabhu P,et al.Automatic CPU-GPU communication management and optimization[A].Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation[C].ACM,2011.142-151.
[13]
Lee J,Kim J,Seo S,et al.An OpenCL framework for heterogeneous multicores with local memory[A].Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques[C].ACM,2010.193-204.
[14]
Reyes R,López-Rodríguez I,Fumero J J,et al.accULL:An OpenACC implementation with CUDA and OpenCL support[J].Euro-Par Parallel Processing Lecture Notes in Computer Science,2012,7484:871-882.
[15]
Wienke S,Springer P,et al.OpenACC-first experiences with real-world applications[J].Euro-Par Parallel Processing Lecture Notes in Computer Science,2012,7484:859-870.
[16]
Maillard N.Hybrid parallel programming:evaluation of OpenACC[D].Universidade Federal do rio Grande do Sul,2012.
[17]
Chen C,Yang C,Tang T,et al.OpenACC to Intel offload:automatic translation and optimization[J].Computer Engineering and Technology Communications in Computer and Information Science,2013,396:111-120.
[18]
Kumar Pusukuri K,Gupta R,et al.ADAPT:a framework for coscheduling multithreaded programs[J].ACM Transactions on Architecture and Code Optimization,2013,9(4):45.
[19]
Stratton J A,et al.Parboil:a revised benchmark suite for scientific and commercial throughput computing[R].Illinois,US:Center for Reliable and High-Performance Computing of University of Illinois at Urbana-Champaign,2012.
[20]
Podlozhnyuk V,Harris M.Monte Carlo option pricing[R].California,US:NVIDIA Corporation,2008.
[21]
Govindaraju N K,Lloyd B,Dotsenko Y,et al.High performance discrete Fourier transforms on graphics processors[A].Proceedings of the 2008 ACM/IEEE Conference on Supercomputing[C].IEEE,2008.2.
[22]
Nyland L,Harris M,Prins J.Fast n-body simulation with cuda[J].GPU Gems,2007,3(1):677-696.
[23]
Che S,Boyer M,Meng J,et al.Rodinia:A benchmark suite for heterogeneous computing[A].Proceedings of the 2009 International Symposium on Workload Characterization[C].IEEE,2009.44-54.
[24]
Brodtkorb A R,Dyken C,Hagen T R,Hjelmervik J M,Storaasli O O.State-of-the-art in heterogeneous computing[J].Scientific Programming,2010,18(1):1-33.
[25]
Top 500 supercomputer sites [OL].http://www.top500.org/,2012-12.
[26]
Kothapalli K,Banerjee D S,et al.CPU and/or GPU:Revisiting the GPU Vs CPU Myth[J].arXiv,2013,1303(2171):1-20.
[27]
Saha B,Zhou X,Chen H,et al.Programming model for a heterogeneous x86 platform[A].Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation [C].ACM,2009.431-440.
[28]
Brodtkorb A R,et al.Graphics processing unit(GPU)programming strategies and trends in GPU computing[J].Journal of Parallel and Distributed Computing,2013,73(1):4-13.
Yang Y,Xiang P,Mantor M,et al.CPU-assisted GPGPU on fused CPU-GPU architectures[A].Proceedings of the 18th International Symposium on High Performance Computer Architecture [C].IEEE,2012.1-12.
[31]
Daga M,Aji A M,Feng W.On the efficacy of a fused cpu+ gpu processor(or apu)for parallel computing[A].Proceedings of the Symposium on Application Accelerators in High-Performance Computing [C].IEEE,2011.141-149.
[32]
Han T D,Abdelrahman T S.hiCUDA:High-level GPGPU programming[J].IEEE Transactions on Parallel and Distributed Systems,2011,22(1):78-90.
[33]
Baskaran M M,Ramanujam J,Sadayappan P.Automatic C-to-CUDA code generation for affine programs[J].Compiler Construction,2010,6011:244-263.
[34]
Linderman M D,Collins J D,Wang H,et al.Merge:a programming model for heterogeneous multi-core systems[A].Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems[C].ACM,2008.287-296.
[35]
Dubach C,Cheng P,Rabbah R,et al.Compiling a high-level language for GPUs:(via language support for architectures and compilers)[A].Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation[C].ACM,2012.1-12.
[36]
Liu W,Lewis B,Zhou X,et al.A balanced programming model for emerging heterogeneous multicore systems[A].Proceedings of the 2nd USENIX Conference on Hot Topics in Parallelism[C].USENIX Association,2010.3-3.
[37]
Gelado I,Stone J E,Cabezas J,et al.An asymmetric distributed shared memory model for heterogeneous parallel systems[A].Proceedings of the 15th edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems[C].ACM,2010.347-358.
[38]
Ryoo S,Rodrigues C I,Baghsorkhi S S,et al.Optimization principles and application performance evaluation of a multithreaded GPU using CUDA[A].Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming[C].ACM,2008.73-82.
[39]
Baskaran M M,Bondhugula U,Krishnamoorthy S,et al.A compiler framework for optimization of affine loop nests for GPGPUs[A].Proceedings of the 22nd Annual International Conference on Supercomputing[C].ACM,2008.225-234.
[40]
Jang B,Schaa D,Mistry P,et al.Exploiting memory access patterns to improve memory performance in data-parallel architectures[J].IEEE Transactions on Parallel and Distributed Systems,2011,22(1):105-118.
[41]
Sundaram N,et al.A framework for efficient and scalable execution of domain-specific templates on GPUs[A].Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing[C].IEEE,2009.1-12.
[42]
He B,Fang W,Luo Q,et al.Mars:a MapReduce framework on graphics processors[A].Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques[C].ACM,2008.260-269.
[43]
Liu Y,Zhang E Z,Shen X.A cross-input adaptive framework for GPU program optimizations[A].Proceedings of the 2009 International Symposium on Parallel & Distributed Processing[C].IEEE,2009.1-10.
[44]
Lee J,Lakshminarayana N B,et al.Many-thread aware prefetching mechanisms for gpgpu applications[A].Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture [C].IEEE,2010.213-224.
[45]
Yang Y,Xiang P,et al.A GPGPU compiler for memory optimization and parallelism management[A].Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation[C].ACM,2010.86-97.
[46]
Yang Y,Xiang P,et al.Shared memory multiplexing:a novel way to improve GPGPU throughput[A].Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques[C].ACM,2012.283-292.
[47]
张保,等.面向图形处理器重叠通信与计算的数据划分方法[J].西安交通大学学报,2011,45(4):1-6. Zhang Bao,et al.Novel GPU data partitioning method to overlap communication and computation[J].Journal of Xi''an Jiaotong University,2011,45(4):1-6.(in Chinese)
[48]
Volkov V,Demmel J W.Benchmarking GPUs to tune dense linear algebra[A].Proceedings of the 2008 ACM/IEEE Conference on Supercomputing[C].IEEE,2008.31.