BLP: Block-Level Pipelining for GPUs

Feng, Wu-Chun; Cui, Xuewen; Scogland, Thomas; De Supinski, Bronis

BLP: Block-Level Pipelining for GPUs

Files

Published version (678.48 KB)

Downloads: 27

Date

2024-05-07

Authors

Publisher

ACM

Abstract

Programming models like OpenMP offer expressive interfaces to program graphics processing units (GPUs) via directive-based off-load. By default, these models copy data to or from the device without overlapping computation, thus impacting performance. Rather than leave the onerous task of manually pipelining and tuning data communication and computation to the end user, we propose an OpenMP extension that supports block-level pipelining and, in turn, present our block-level pipelining (BLP) approach that overlaps data communication and computation in a single kernel. BLP uses persistent thread blocks with cooperative thread groups to process sub-tasks on different streaming multiprocessors and uses GPU flag arrays to enforce task dependencies without CPU involvement.

To demonstrate the efficacy of BLP, we evaluate its performance using multiple benchmarks on NVIDIA V100 GPUs. Our experimental results show that BLP achieves 95% to 114% of the performance of hand-tuned kernel-level pipelining. In addition, using BLP with buffer mapping can reduce memory usage to support GPU memory oversubscription.We also show that BLP can reduce memory usage by 75% to 86% for data sets that exceed GPU memory while providing significantly better performance than CUDA Unified Memory (UM) with prefetching optimizations.

Persistent link

https://hdl.handle.net/10919/120877

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Computer Science

Full item page

BLP: Block-Level Pipelining for GPUs

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections