Scalable and Productive Data Management for High-Performance Analytics

Youssef, Karim Yasser Mohamed Yousri

Scalable and Productive Data Management for High-Performance Analytics

Files

Youssef_KY_D_2023.pdf (3.77 MB)

Downloads: 322

Date

2023-11-07

Authors

Youssef, Karim Yasser Mohamed Yousri

Publisher

Virginia Tech

Abstract

Advancements in data acquisition technologies across different domains, from genome sequencing to satellite and telescope imaging to large-scale physics simulations, are leading to an exponential growth in dataset sizes. Extracting knowledge from this wealth of data enables scientific discoveries at unprecedented scales. However, the sheer volume of the gathered datasets is a bottleneck for knowledge discovery. High-performance computing (HPC) provides a scalable infrastructure to extract knowledge from these massive datasets. However, multiple data management performance gaps exist between big data analytics software and HPC systems. These gaps arise from multiple factors, including the tradeoff between performance and programming productivity, data growth at a faster rate than memory capacity, and the high storage footprints of data analytics workflows. This dissertation bridges these gaps by combining productive data management interfaces with application-specific optimizations of data parallelism, memory operation, and storage management. First, we address the performance-productivity tradeoff by leveraging Spark and optimizing input data partitioning. Our solution optimizes programming productivity while achieving comparable performance to the Message Passing Interface (MPI) for scalable bioinformatics. Second, we address the operating system's kernel limitations for out-of-core data processing by autotuning memory management parameters in userspace. Finally, we address I/O and storage efficiency bottlenecks in data analytics workflows that iteratively and incrementally create and reuse persistent data structures such as graphs, data frames, and key-value datastores.

Keywords

high-performance computing (HPC), big data, performance, productivity, storage efficiency

Persistent link

http://hdl.handle.net/10919/116640

Collections

Doctoral Dissertations

Full item page

Scalable and Productive Data Management for High-Performance Analytics

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections