Browsing by Author "Hanafy, Yasser Y."
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- AutoMatch: Automated Matching of Compute Kernels to Heterogeneous HPC ArchitecturesHelal, Ahmed E.; Feng, Wu-chun; Jung, Changhee; Hanafy, Yasser Y. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-12-13)HPC systems contain a wide variety of heterogeneous computing resources, ranging from general-purpose CPUs to specialized accelerators. Porting sequential applications to such systems for achieving high performance requires significant software and hardware expertise as well as extensive manual analysis of both the target architectures and applications to decide the best performing architecture and implementation technique for each application. To streamline this tedious process, this paper presents AutoMatch, a tool for automated matching of compute kernels to heterogeneous HPC architectures. AutoMatch analyzes the sequential application code and automatically predicts the performance of the best parallel implementation of its compute kernels on different hardware architectures. AutoMatch leverages such prediction results to identify the best device for each kernel from a set of devices including multi-core CPUs and many-core GPUs. In addition, it estimates the relative execution cost between the different architectures to drive a workload distribution scheme, which enables end users to efficiently exploit the available compute resources across multiple heterogeneous architectures. We demonstrate the efficacy of AutoMatch, using a set of open-source HPC applications and benchmarks with different parallelism profiles and memory-access patterns. The empirical evaluation shows that AutoMatch is highly accurate across five different heterogeneous architectures, identifying the best architecture for each workload in 96% of the test cases, and its workload distribution scheme has a comparable performance to a profiling-driven oracle.
- Automated Runtime Analysis and Adaptation for Scalable Heterogeneous ComputingHelal, Ahmed Elmohamadi Mohamed (Virginia Tech, 2020-01-29)In the last decade, there have been tectonic shifts in computer hardware because of reaching the physical limits of the sequential CPU performance. As a consequence, current high-performance computing (HPC) systems integrate a wide variety of compute resources with different capabilities and execution models, ranging from multi-core CPUs to many-core accelerators. While such heterogeneous systems can enable dramatic acceleration of user applications, extracting optimal performance via manual analysis and optimization is a complicated and time-consuming process. This dissertation presents graph-structured program representations to reason about the performance bottlenecks on modern HPC systems and to guide novel automation frameworks for performance analysis and modeling and runtime adaptation. The proposed program representations exploit domain knowledge and capture the inherent computation and communication patterns in user applications, at multiple levels of computational granularity, via compiler analysis and dynamic instrumentation. The empirical results demonstrate that the introduced modeling frameworks accurately estimate the realizable parallel performance and scalability of a given sequential code when ported to heterogeneous HPC systems. As a result, these frameworks enable efficient workload distribution schemes that utilize all the available compute resources in a performance-proportional way. In addition, the proposed runtime adaptation frameworks significantly improve the end-to-end performance of important real-world applications which suffer from limited parallelism and fine-grained data dependencies. Specifically, compared to the state-of-the-art methods, such an adaptive parallel execution achieves up to an order-of-magnitude speedup on the target HPC systems while preserving the inherent data dependencies of user applications.
- CommAnalyzer: Automated Estimation of Communication Cost on HPC Clusters Using Sequential CodeHelal, Ahmed E.; Jung, Changhee; Feng, Wu-chun; Hanafy, Yasser Y. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2017-08-14)MPI+X is the de facto standard for programming applications on HPC clusters. The performance and scalability on such systems is limited by the communication cost on different number of processes and compute nodes. Therefore, the current communication analysis tools play a critical role in the design and development of HPC applications. However, these tools require the availability of the MPI implementation, which might not exist in the early stage of the development process due to the parallel programming effort and time. This paper presents CommAnalyzer, an automated tool for communication model generation from a sequential code. CommAnalyzer uses novel compiler analysis techniques and graph algorithms to capture the inherent communication characteristics of sequential applications, and to estimate their communication cost on HPC systems. The experiments with real-world, regular and irregular scientific applications demonstrate the utility of CommAnalyzer in estimating the communication cost on HPC clusters with more than 95% accuracy on average.
- Using Workload Characterization to Guide High Performance Graph ProcessingHassan, Mohamed Wasfy Abdelfattah (Virginia Tech, 2021-05-24)Graph analytics represent an important application domain widely used in many fields such as web graphs, social networks, and Bayesian networks. The sheer size of the graph data sets combined with the irregular nature of the underlying problem pose a significant challenge for performance, scalability, and power efficiency of graph processing. With the exponential growth of the size of graph datasets, there is an ever-growing need for faster more power efficient graph solvers. The computational needs of graph processing can take advantage of the FPGAs' power efficiency and customizable architecture paired with CPUs' general purpose processing power and sophisticated cache policies. CPU-FPGA hybrid systems have the potential for supporting performant and scalable graph solvers if both devices can work coherently to make up for each other's deficits. This study aims to optimize graph processing on heterogeneous systems through interdisciplinary research that would impact both the graph processing community, and the FPGA/heterogeneous computing community. On one hand, this research explores how to harness the computational power of FPGAs and how to cooperatively work in a CPU-FPGA hybrid system. On the other hand, graph applications have a data-driven execution profile; hence, this study explores how to take advantage of information about the graph input properties to optimize the performance of graph solvers. The introduction of High Level Synthesis (HLS) tools allowed FPGAs to be accessible to the masses but they are yet to be performant and efficient, especially in the case of irregular graph applications. Therefore, this dissertation proposes automated frameworks to help integrate FPGAs into mainstream computing. This is achieved by first exploring the optimization space of HLS-FPGA designs, then devising a domain-specific performance model that is used to build an automated framework to guide the optimization process. Moreover, the architectural strengths of both CPUs and FPGAs are exploited to maximize graph processing performance via an automated framework for workload distribution on the available hardware resources.