Scalable and Productive Data Management for High-Performance Analytics
Advancements in data acquisition technologies across different domains, from genome sequencing to satellite and telescope imaging to large-scale physics simulations, are leading to an exponential growth in dataset sizes. Extracting knowledge from this wealth of data enables scientific discoveries at unprecedented scales. However, the sheer volume of the gathered datasets is a bottleneck for knowledge discovery. High-performance computing (HPC) provides a scalable infrastructure to extract knowledge from these massive datasets. However, multiple data management performance gaps exist between big data analytics software and HPC systems. These gaps arise from multiple factors, including the tradeoff between performance and programming productivity, data growth at a faster rate than memory capacity, and the high storage footprints of data analytics workflows. This dissertation bridges these gaps by combining productive data management interfaces with application-specific optimizations of data parallelism, memory operation, and storage management. First, we address the performance-productivity tradeoff by leveraging Spark and optimizing input data partitioning. Our solution optimizes programming productivity while achieving comparable performance to the Message Passing Interface (MPI) for scalable bioinformatics. Second, we address the operating system's kernel limitations for out-of-core data processing by autotuning memory management parameters in userspace. Finally, we address I/O and storage efficiency bottlenecks in data analytics workflows that iteratively and incrementally create and reuse persistent data structures such as graphs, data frames, and key-value datastores.