Performance Portability of CUDA Across NVIDIA GPU Architectures
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Graphics Processing Units (GPUs) provide impressive parallel performance that makes them invaluable to a number of computational workloads such as machine learning, simulations, and many others. NVIDIA GPUs currently outperform all of their competitors and thus make up the lion's share of today's market. Importantly, they are natively programmed using the proprietary framework Compute Unified Device Architecture (CUDA), which only compiles to machine code for NVIDIA hardware. Moreover, NVIDIA releases a new GPU with an updated architecture roughly every two to three years. Since CUDA is commonly forward compatible with the next generation of GPUs, it is natural to reuse CUDA code built for a previous architecture on a newer one. Unfortunately, the performance of CUDA applications from one architecture to the next does not necessarily benefit from the newer generation of GPUs. This work investigates a variety of CUDA workloads that fail to show a performance uplift moving from the V100 to A100 GPUs. While some kernels perform as expected, others exhibit up to a 700% performance drop when running on the newer architecture. For each, an analysis of the benchmarks is provided, and for some, a direct solution for improving performance portability is highlighted, where possible. These issues are also cross examined to provide a few holistic portability concerns. At the end, a set of programmer recommendations are made to assist developers in more easily maintaining performance portability between architectures.