Programming High-Performance Clusters with Heterogeneous Computing Devices

Aji, Ashwin M.

Programming High-Performance Clusters with Heterogeneous Computing Devices

Files

Aji_AM_D_2015.pdf (9.92 MB)

Downloads: 2079

Date

2015-05-19

Authors

Aji, Ashwin M.

Publisher

Virginia Tech

Abstract

Today's high-performance computing (HPC) clusters are seeing an increase in the adoption of accelerators like GPUs, FPGAs and co-processors, leading to heterogeneity in the computation and memory subsystems. To program such systems, application developers typically employ a hybrid programming model of MPI across the compute nodes in the cluster and an accelerator-specific library (e.g.; CUDA, OpenCL, OpenMP, OpenACC) across the accelerator devices within each compute node. Such explicit management of disjointed computation and memory resources leads to reduced productivity and performance. This dissertation focuses on designing, implementing and evaluating a runtime system for HPC clusters with heterogeneous computing devices. This work also explores extending existing programming models to make use of our runtime system for easier code modernization of existing applications. Specifically, we present MPI-ACC, an extension to the popular MPI programming model and runtime system for efficient data movement and automatic task mapping across the CPUs and accelerators within a cluster, and discuss the lessons learned.

MPI-ACC's task-mapping runtime subsystem performs fast and automatic device selection for a given task. MPI-ACC's data-movement subsystem includes careful optimizations for end-to-end communication among CPUs and accelerators, which are seamlessly leveraged by the application developers. MPI-ACC provides a familiar, flexible and natural interface for programmers to choose the right computation or communication targets, while its runtime system achieves efficient cluster utilization.

Keywords

Runtime Systems, Programming Models, General Purpose Graphics Processing Units (GPGPUs), Message Passing Interface (MPI), CUDA, OpenCL

Persistent link

http://hdl.handle.net/10919/52366

Collections

Doctoral Dissertations

Full item page