Inter-Block GPU Communication via Fast Barrier Synchronization
The graphics processing unit (GPU) has evolved from a ﬁxed-function processor with programmable stages to a programmable processor with many ﬁxed-function components that deliver massive parallelism. Consequently, GPUs increasingly take advantage of the programmable processing power for general-purpose, non-graphics tasks, i.e., general-purpose computation on graphics processing units (GPGPU). However, while the GPU can massively accelerate data parallel (or task parallel) applications, the lack of explicit support for inter-block communication on the GPU hampers its broader adoption as a general-purpose computing device. Inter-block communication on the GPU occurs via global memory and then requires a barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, incurs signiﬁcant overhead. Thus, we seek to propose more efﬁcient methods for inter-block communication. To systematically address this problem, we ﬁrst present a performance model for the execution of kernels on GPUs. This performance model partitions the kernel’s execution time into three phases: (1) kernel launch to the GPU, (2) computation on the GPU, and (3) inter-block GPU communication via barrier synchronization. Using three well-known algorithms — FFT, dynamic programming, and bitonic sort — we show that the latter phase, i.e., inter-block GPU communication, can consume more than 50% of the overall execution time. Therefore, we propose three new approaches to inter-block GPU communication via barrier synchronization, all of which run only on the GPU: GPU simple synchronization, GPU tree-based synchronization, and GPU lock-free synchronization. We then evaluate the efficacy of each of these approaches in isolation via a micro-benchmark as well as integrated with the three aforementioned algorithms. For the micro-benchmark, the experimental results show that our GPU lock-free synchronization performs 7.8 times faster than CPU explicit synchronization and 3.7 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization improves the performance by 8%, 24%, and 39%, respectively, when compared to the more efﬁcient CPU implicit synchronization.