Performance Portability of CUDA Across NVIDIA GPU Architectures
dc.contributor.author | Coyne, Timothy Patrick | en |
dc.contributor.committeechair | Nikolopoulos, Dimitrios S. | en |
dc.contributor.committeemember | Sandu, Adrian | en |
dc.contributor.committeemember | Feng, Wu-Chun | en |
dc.contributor.department | Computer Science and#38; Applications | en |
dc.date.accessioned | 2025-06-04T08:05:24Z | en |
dc.date.available | 2025-06-04T08:05:24Z | en |
dc.date.issued | 2025-06-03 | en |
dc.description.abstract | Graphics Processing Units (GPUs) provide impressive parallel performance that makes them invaluable to a number of computational workloads such as machine learning, simulations, and many others. NVIDIA GPUs currently outperform all of their competitors and thus make up the lion's share of today's market. Importantly, they are natively programmed using the proprietary framework Compute Unified Device Architecture (CUDA), which only compiles to machine code for NVIDIA hardware. Moreover, NVIDIA releases a new GPU with an updated architecture roughly every two to three years. Since CUDA is commonly forward compatible with the next generation of GPUs, it is natural to reuse CUDA code built for a previous architecture on a newer one. Unfortunately, the performance of CUDA applications from one architecture to the next does not necessarily benefit from the newer generation of GPUs. This work investigates a variety of CUDA workloads that fail to show a performance uplift moving from the V100 to A100 GPUs. While some kernels perform as expected, others exhibit up to a 700% performance drop when running on the newer architecture. For each, an analysis of the benchmarks is provided, and for some, a direct solution for improving performance portability is highlighted, where possible. These issues are also cross examined to provide a few holistic portability concerns. At the end, a set of programmer recommendations are made to assist developers in more easily maintaining performance portability between architectures. | en |
dc.description.abstractgeneral | Graphics Processing Units (GPUs) provide high parallel performance by executing instructions across many smaller internal compute units. This high parallel performance greatly benefits numerous workloads, including machine learning, simulations, and many others. NVIDIA currently has the largest market share of GPUs, which are natively programmed using Compute Unified Device Architecture (CUDA). The company typically releases a new family of GPUs with updated architectures every 2 to 3 years. Given that CUDA is the standard language for programming these GPUs and the relatively high frequency at which new architectures are released by NVIDIA, it is essential for CUDA applications to exhibit strong performance portability. In other words, a new GPU should provide uplift for pre-existing kernels proportional to its generational improvements. Unfortunately, this is not always the case which means developers must sometimes retrofit their code in order to obtain optimal performance. This research investigates the performance portability of a number of different workloads and provides a set of programmer recommendations to assist developers in maximizing performance portability. | en |
dc.description.degree | Master of Science | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:43653 | en |
dc.identifier.uri | https://hdl.handle.net/10919/135038 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | GPU | en |
dc.subject | CUDA | en |
dc.subject | Performance Portability | en |
dc.title | Performance Portability of CUDA Across NVIDIA GPU Architectures | en |
dc.type | Thesis | en |
thesis.degree.discipline | Computer Science & Applications | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | masters | en |
thesis.degree.name | Master of Science | en |