Browsing by Author "Cui, Xuewen"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- BLP: Block-Level Pipelining for GPUsFeng, Wu-Chun; Cui, Xuewen; Scogland, Thomas; De Supinski, Bronis (ACM, 2024-05-07)Programming models like OpenMP offer expressive interfaces to program graphics processing units (GPUs) via directive-based off-load. By default, these models copy data to or from the device without overlapping computation, thus impacting performance. Rather than leave the onerous task of manually pipelining and tuning data communication and computation to the end user, we propose an OpenMP extension that supports block-level pipelining and, in turn, present our block-level pipelining (BLP) approach that overlaps data communication and computation in a single kernel. BLP uses persistent thread blocks with cooperative thread groups to process sub-tasks on different streaming multiprocessors and uses GPU flag arrays to enforce task dependencies without CPU involvement. To demonstrate the efficacy of BLP, we evaluate its performance using multiple benchmarks on NVIDIA V100 GPUs. Our experimental results show that BLP achieves 95% to 114% of the performance of hand-tuned kernel-level pipelining. In addition, using BLP with buffer mapping can reduce memory usage to support GPU memory oversubscription.We also show that BLP can reduce memory usage by 75% to 86% for data sets that exceed GPU memory while providing significantly better performance than CUDA Unified Memory (UM) with prefetching optimizations.
- Classification Team Project for IDEAL in CS5604, Spring 2015Cui, Xuewen; Tao, Rongrong; Zhang, Ruide (2015-05-10)Given the tweets from the instructor and cleaned webpages from the Reducing Noise team, the planned tasks for our group were to find the best: (1) way to extract information that will be used for document representation; (2) feature selection method to construct feature vectors; and (3) way to classify each document into categories, considering the ontology developed in the IDEAL project. We have figured out an information extraction method for document representation, feature selection method for feature vector construction, and classification method. The categories will be associated with the documents, to aid searching and browsing using Solr. Our team handles both tweets and webpages. The tweets and webpages come in the form of text files that have been produced by the Reducing Noise team. The other input is a list of the specific events that the collections are about. We are able to construct feature vectors after information extraction and feature selection using Apache Mahout. For each document, a relational version of the raw data for an appropriate feature vector is generated. We applied the Naïve Bayes classification algorithm in Apache Mahout to generate the vector file and the trained model. The classification algorithm uses the feature vectors to go into classifiers for training and testing that works with Mahout. However, Mahout is not able to predict class labels for new data. Finally we came to a solution provided by Pangool.net, which is a Java, low-level MapReduce API. This package provides us a MapReduce Naïve Bayes classifier that can predict class labels for new data. After modification, this package is able to read in and output to AVRO file in HDFS. The correctness of our classification algorithms, using 5-fold cross-validation, was promising.
- Directive-Based Data Partitioning and Pipelining and Auto-Tuning for High-Performance GPU ComputingCui, Xuewen (Virginia Tech, 2020-12-15)The computer science community needs simpler mechanisms to achieve the performance potential of accelerators, such as graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and co-processors (e.g., Intel Xeon Phi), due to their increasing use in state-of-the-art supercomputers. Over the past 10 years, we have seen a significant improvement in both computing power and memory connection bandwidth for accelerators. However, we also observe that the computation power has grown significantly faster than the interconnection bandwidth between the central processing unit (CPU) and the accelerator. Given that accelerators generally have their own discrete memory space, data needs to be copied from the CPU host memory to the accelerator (device) memory before computation starts on the accelerator. Moreover, programming models like CUDA, OpenMP, OpenACC, and OpenCL can efficiently offload compute-intensive workloads to these accelerators. However, achieving the overlapping of data transfers with computation in a kernel with these models is neither simple nor straightforward. Instead, codes copy data to or from the device without overlapping or requiring explicit user design and refactoring. Achieving performance can require extensive refactoring and hand-tuning to apply data transfer optimizations, and users must manually partition their dataset whenever its size is larger than device memory, which can be highly difficult when the device memory size is not exposed to the user. As the systems are becoming more and more complex in terms of heterogeneity, CPUs are responsible for handling many tasks related to other accelerators, computation and data movement tasks, task dependency checking, and task callbacks. Leaving all logic controls to the CPU not only costs extra communication delay over the PCI-e bus but also consumes the CPU resources, which may affect the performance of other CPU tasks. This thesis work aims to provide efficient directive-based data pipelining approaches for GPUs that tackle these issues and improve performance, programmability, and memory management.
- IterML: Iterative Machine Learning for Intelligent Parameter Pruning and Tuning in Graphics Processing UnitsCui, Xuewen; Feng, Wu-chun (Springer, 2020-11-06)With the rise of graphics processing units (GPUs), the parallel computing community needs better tools to productively extract performance from the GPU. While modern compilers provide flags to activate different optimizations to improve performance, the effectiveness of such automated optimization has been limited at best. As a consequence, extracting the best performance from an algorithm on a GPU requires significant expertise and manual effort to exploit both spatial and temporal sharing of computing resources. In particular, maximizing the performance of an algorithm on a GPU requires extensive hyperparameter (e.g., thread-block size) selection and tuning. Given the myriad of hyperparameter dimensions to optimize across, the search space of optimizations is extremely large, making it infeasible to exhaustively evaluate. This paper proposes an approach that uses statistical analysis with iterative machine learning (IterML) to prune and tune hyperparameters to achieve better performance. During each iteration, we leverage machine-learning models to guide the pruning and tuning for subsequent iterations. We evaluate our IterML approach on the GPU thread-block size across many benchmarks running on an NVIDIA P100 or V100 GPU. Our experimental results show that our automated IterML approach reduces search effort by 40% to 80% when compared to traditional (non-iterative) ML and that the performance of our (unmodified) GPU applications can improve significantly — between 67% and 95% — simply by changing the thread-block size.