Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems
MetadataShow full item record
The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look at the performance of these supercomputers reveals that they achieve only around 50% of their theoretical peak performance. This suggests that applications that were tuned for erstwhile homogeneous computing may not be efficient for todayâ s heterogeneous computing and hence, novel optimization strategies are required to be exercised. However, optimizing an application for heterogeneous computing systems is extremely challenging, primarily due to the architectural differences in computational units in such systems. This thesis intends to act as a cookbook for optimizing applications on heterogeneous computing systems that employ graphics processing units (GPUs) as the preferred mode of accelerators. We discuss optimization strategies for multicore CPUs as well as for the two popular GPU platforms, i.e., GPUs from AMD and NVIDIA. Optimization strategies for NVIDIA GPUs have been well studied but when applied on AMD GPUs, they fail to measurably improve performance because of the differences in underlying architecture. To the best of our knowledge, this research is the first to propose optimization strategies for AMD GPUs. Even on NVIDIA GPUs, there exists a lesser known but an extremely severe performance pitfall called partition camping, which can affect application performance by up to seven-fold. To facilitate the detection of this phenomenon, we have developed a performance prediction model that analyzes and characterizes the effect of partition camping in GPU applications. We have used a large-scale, molecular modeling application to validate and verify all the optimization strategies. Our results illustrate that if appropriately optimized, AMD and NVIDIA GPUs can provide 371-fold and 328-fold improvement, respectively, over a hand-tuned, SSE-optimized serial implementation.
- Masters Theses