VTechWorks staff will be away for the winter holidays starting Tuesday, December 24, 2024, through Wednesday, January 1, 2025, and will not be replying to requests during this time. Thank you for your patience, and happy holidays!
 

BLP: Block-Level Pipelining for GPUs

TR Number

Date

2024-05-07

Journal Title

Journal ISSN

Volume Title

Publisher

ACM

Abstract

Programming models like OpenMP offer expressive interfaces to program graphics processing units (GPUs) via directive-based off-load. By default, these models copy data to or from the device without overlapping computation, thus impacting performance. Rather than leave the onerous task of manually pipelining and tuning data communication and computation to the end user, we propose an OpenMP extension that supports block-level pipelining and, in turn, present our block-level pipelining (BLP) approach that overlaps data communication and computation in a single kernel. BLP uses persistent thread blocks with cooperative thread groups to process sub-tasks on different streaming multiprocessors and uses GPU flag arrays to enforce task dependencies without CPU involvement.

To demonstrate the efficacy of BLP, we evaluate its performance using multiple benchmarks on NVIDIA V100 GPUs. Our experimental results show that BLP achieves 95% to 114% of the performance of hand-tuned kernel-level pipelining. In addition, using BLP with buffer mapping can reduce memory usage to support GPU memory oversubscription.We also show that BLP can reduce memory usage by 75% to 86% for data sets that exceed GPU memory while providing significantly better performance than CUDA Unified Memory (UM) with prefetching optimizations.

Description

Keywords

Citation