Cui, Xuewen2020-12-162020-12-162020-12-15vt_gsexam:27641http://hdl.handle.net/10919/101497The computer science community needs simpler mechanisms to achieve the performance potential of accelerators, such as graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and co-processors (e.g., Intel Xeon Phi), due to their increasing use in state-of-the-art supercomputers. Over the past 10 years, we have seen a significant improvement in both computing power and memory connection bandwidth for accelerators. However, we also observe that the computation power has grown significantly faster than the interconnection bandwidth between the central processing unit (CPU) and the accelerator. Given that accelerators generally have their own discrete memory space, data needs to be copied from the CPU host memory to the accelerator (device) memory before computation starts on the accelerator. Moreover, programming models like CUDA, OpenMP, OpenACC, and OpenCL can efficiently offload compute-intensive workloads to these accelerators. However, achieving the overlapping of data transfers with computation in a kernel with these models is neither simple nor straightforward. Instead, codes copy data to or from the device without overlapping or requiring explicit user design and refactoring. Achieving performance can require extensive refactoring and hand-tuning to apply data transfer optimizations, and users must manually partition their dataset whenever its size is larger than device memory, which can be highly difficult when the device memory size is not exposed to the user. As the systems are becoming more and more complex in terms of heterogeneity, CPUs are responsible for handling many tasks related to other accelerators, computation and data movement tasks, task dependency checking, and task callbacks. Leaving all logic controls to the CPU not only costs extra communication delay over the PCI-e bus but also consumes the CPU resources, which may affect the performance of other CPU tasks. This thesis work aims to provide efficient directive-based data pipelining approaches for GPUs that tackle these issues and improve performance, programmability, and memory management.ETDIn CopyrightOpenMPGPUPipelinePartitioningUnified MemoryHigh-performance ComputingMachine learningDirective-Based Data Partitioning and Pipelining and Auto-Tuning for High-Performance GPU ComputingDissertation