Inter-Block GPU Communication via Fast Barrier Synchronization

Xiao, Shucai; Feng, Wu-chun

Inter-Block GPU Communication via Fast Barrier Synchronization

dc.contributor.author	Xiao, Shucai	en
dc.contributor.author	Feng, Wu-chun	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2013-06-19T14:35:43Z	en
dc.date.available	2013-06-19T14:35:43Z	en
dc.date.issued	2009	en
dc.description.abstract	The graphics processing unit (GPU) has evolved from a ﬁxed-function processor with programmable stages to a programmable processor with many ﬁxed-function components that deliver massive parallelism. Consequently, GPUs increasingly take advantage of the programmable processing power for general-purpose, non-graphics tasks, i.e., general-purpose computation on graphics processing units (GPGPU). However, while the GPU can massively accelerate data parallel (or task parallel) applications, the lack of explicit support for inter-block communication on the GPU hampers its broader adoption as a general-purpose computing device. Inter-block communication on the GPU occurs via global memory and then requires a barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, incurs signiﬁcant overhead. Thus, we seek to propose more efﬁcient methods for inter-block communication. To systematically address this problem, we ﬁrst present a performance model for the execution of kernels on GPUs. This performance model partitions the kernel’s execution time into three phases: (1) kernel launch to the GPU, (2) computation on the GPU, and (3) inter-block GPU communication via barrier synchronization. Using three well-known algorithms — FFT, dynamic programming, and bitonic sort — we show that the latter phase, i.e., inter-block GPU communication, can consume more than 50% of the overall execution time. Therefore, we propose three new approaches to inter-block GPU communication via barrier synchronization, all of which run only on the GPU: GPU simple synchronization, GPU tree-based synchronization, and GPU lock-free synchronization. We then evaluate the efficacy of each of these approaches in isolation via a micro-benchmark as well as integrated with the three aforementioned algorithms. For the micro-benchmark, the experimental results show that our GPU lock-free synchronization performs 7.8 times faster than CPU explicit synchronization and 3.7 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization improves the performance by 8%, 24%, and 39%, respectively, when compared to the more efﬁcient CPU implicit synchronization.	en
dc.format.mimetype	application/pdf	en
dc.identifier	http://eprints.cs.vt.edu/archive/00001087/	en
dc.identifier.sourceurl	http://eprints.cs.vt.edu/archive/00001087/01/TR_GPU_synchronization.pdf	en
dc.identifier.trnumber	TR-09-19	en
dc.identifier.uri	http://hdl.handle.net/10919/19873	en
dc.language.iso	en	en
dc.publisher	Department of Computer Science, Virginia Polytechnic Institute & State University	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Parallel computation	en
dc.title	Inter-Block GPU Communication via Fast Barrier Synchronization	en
dc.type	Technical report	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: TR_GPU_synchronization.pdf
Size:: 2.37 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer Science Technical Reports