Can Large Language Models Predict Parallel Code Performance?

Bolet, Gregory; Georgakoudis, Giorgis; Menon, Harshitha; Parasyris, Konstantinos; Hasabnis, Niranjan; Estes, Hayden; Cameron, Kirk; Oren, Gal

Can Large Language Models Predict Parallel Code Performance?

dc.contributor.author	Bolet, Gregory	en
dc.contributor.author	Georgakoudis, Giorgis	en
dc.contributor.author	Menon, Harshitha	en
dc.contributor.author	Parasyris, Konstantinos	en
dc.contributor.author	Hasabnis, Niranjan	en
dc.contributor.author	Estes, Hayden	en
dc.contributor.author	Cameron, Kirk	en
dc.contributor.author	Oren, Gal	en
dc.date.accessioned	2025-10-01T17:56:26Z	en
dc.date.available	2025-10-01T17:56:26Z	en
dc.date.issued	2025-07-20	en
dc.date.updated	2025-10-01T07:46:14Z	en
dc.description.abstract	Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware – an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware.We frame the problem as a roofline classification task: given the source code of a GPU kernel and the hardware specifications of a target GPU, can an LLM predict whether the GPU kernel is compute-bound or bandwidth-bound? For this study, we build a balanced dataset of 340 GPU kernels, obtained from HeCBench benchmark and written in CUDA and OpenMP, along with their ground-truth labels obtained via empirical GPU profiling. We evaluate LLMs across four scenarios: (1) with access to profiling data of the kernel source, (2) zero-shot with source code only, (3) few-shot with code and label pairs, and (4) finetuned on a small custom dataset. Our results show that state-of-theart LLMs have a strong understanding of the Roofline model, achieving 100% classification accuracy when provided with explicit profiling data. We also find that reasoning-capable LLMs significantly outperform standard LLMs in zero- and few-shot settings, achieving up to 64% classification accuracy of GPU source codes, without any profiling information. Lastly, we find that model accuracy does not benefit meaningfully from few-shot prompting compared to zero-shot, and that LLM fine-tuning will require much more data than what we currently have available. This work is among the first to use LLMs for source-level roofline performance prediction via classification, and illustrates their potential to guide optimization efforts when runtime profiling is infeasible. Our findings suggest that with better datasets and prompt strategies, LLMs could become practical tools for HPC performance analysis and performance portability. Code and datasets are publicly available at https: //github.com/Scientific-Computing-Lab/ParallelCodeEstimation.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1145/3731545.3743645	en
dc.identifier.uri	https://hdl.handle.net/10919/137885	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.holder	The author(s)	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.title	Can Large Language Models Predict Parallel Code Performance?	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3731545.3743645.pdf
Size:: 727.04 KB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Computer Science