Inter-Block GPU Communication via Fast Barrier Synchronization

Xiao, Shucai; Feng, Wu-chun

Inter-Block GPU Communication via Fast Barrier Synchronization

Files

TR_GPU_synchronization.pdf (2.37 MB)

Downloads: 3851

TR Number

TR-09-19

Date

2009

Authors

Xiao, Shucai

Feng, Wu-chun

Publisher

Department of Computer Science, Virginia Polytechnic Institute & State University

Abstract

The graphics processing unit (GPU) has evolved from a ﬁxed-function processor with programmable stages to a programmable processor with many ﬁxed-function components that deliver massive parallelism. Consequently, GPUs increasingly take advantage of the programmable processing power for general-purpose, non-graphics tasks, i.e., general-purpose computation on graphics processing units (GPGPU). However, while the GPU can massively accelerate data parallel (or task parallel) applications, the lack of explicit support for inter-block communication on the GPU hampers its broader adoption as a general-purpose computing device. Inter-block communication on the GPU occurs via global memory and then requires a barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, incurs signiﬁcant overhead. Thus, we seek to propose more efﬁcient methods for inter-block communication. To systematically address this problem, we ﬁrst present a performance model for the execution of kernels on GPUs. This performance model partitions the kernel’s execution time into three phases: (1) kernel launch to the GPU, (2) computation on the GPU, and (3) inter-block GPU communication via barrier synchronization. Using three well-known algorithms — FFT, dynamic programming, and bitonic sort — we show that the latter phase, i.e., inter-block GPU communication, can consume more than 50% of the overall execution time. Therefore, we propose three new approaches to inter-block GPU communication via barrier synchronization, all of which run only on the GPU: GPU simple synchronization, GPU tree-based synchronization, and GPU lock-free synchronization. We then evaluate the efficacy of each of these approaches in isolation via a micro-benchmark as well as integrated with the three aforementioned algorithms. For the micro-benchmark, the experimental results show that our GPU lock-free synchronization performs 7.8 times faster than CPU explicit synchronization and 3.7 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization improves the performance by 8%, 24%, and 39%, respectively, when compared to the more efﬁcient CPU implicit synchronization.

Keywords

Parallel computation

Persistent link

http://hdl.handle.net/10919/19873

Collections

Computer Science Technical Reports

Full item page

Inter-Block GPU Communication via Fast Barrier Synchronization

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections