BLP: Block-Level Pipelining for GPUs
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Programming models like OpenMP offer expressive interfaces to program graphics processing units (GPUs) via directive-based off-load. By default, these models copy data to or from the device without overlapping computation, thus impacting performance. Rather than leave the onerous task of manually pipelining and tuning data communication and computation to the end user, we propose an OpenMP extension that supports block-level pipelining and, in turn, present our block-level pipelining (BLP) approach that overlaps data communication and computation in a single kernel. BLP uses persistent thread blocks with cooperative thread groups to process sub-tasks on different streaming multiprocessors and uses GPU flag arrays to enforce task dependencies without CPU involvement.
To demonstrate the efficacy of BLP, we evaluate its performance using multiple benchmarks on NVIDIA V100 GPUs. Our experimental results show that BLP achieves 95% to 114% of the performance of hand-tuned kernel-level pipelining. In addition, using BLP with buffer mapping can reduce memory usage to support GPU memory oversubscription.We also show that BLP can reduce memory usage by 75% to 86% for data sets that exceed GPU memory while providing significantly better performance than CUDA Unified Memory (UM) with prefetching optimizations.