Exploring the Landscape of Big Data Analytics Through Domain-Aware Algorithm Design
Experimental and observational data emerging from various scientific domains necessitate fast, accurate, and low-cost analysis of the data. While exploring the landscape of big data analytics, multiple challenges arise from three characteristics of big data: the volume, the variety, and the velocity. High volume and velocity of the data warrant a large amount of storage, memory, and compute power while a large variety of data demands cognition across domains. Addressing domain-intrinsic properties of data can help us analyze the data efficiently through the frugal use of high-performance computing (HPC) resources. In this thesis, we present our exploration of the data analytics landscape with domain-aware approximate and incremental algorithm design. We propose three guidelines targeting three properties of big data for domain-aware big data analytics: (1) explore geometric and domain-specific properties of high dimensional data for succinct representation, which addresses the volume property, (2) design domain-aware algorithms through mapping of domain problems to computational problems, which addresses the variety property, and (3) leverage incremental arrival of data through incremental analysis and invention of problem-specific merging methodologies, which addresses the velocity property. We demonstrate these three guidelines through the solution approaches of three representative domain problems.
We present Claret, a fast and portable parallel weighted multi-dimensional scaling (WMDS) tool, to demonstrate the application of the first guideline. It combines algorithmic concepts extended from the stochastic force-based multi-dimensional scaling (SF-MDS) and Glimmer. Claret computes approximate weighted Euclidean distances by combining a novel data mapping called stretching and Johnson Lindestrauss' lemma to reduce the complexity of WMDS from O(f(n)d) to O(f(n) log d). In demonstrating the second guideline, we map the problem of identifying multi-hit combinations of genetic mutations responsible for cancers to weighted set cover (WSC) problem by leveraging the semantics of cancer genomic data obtained from cancer biology. Solving the mapped WSC with an approximate algorithm, we identified a set of multi-hit combinations that differentiate between tumor and normal tissue samples. To identify three- and four-hits, which require orders of magnitude larger computational power, we have scaled out the WSC algorithm on a hundred nodes of Summit supercomputer. In demonstrating the third guideline, we developed a tool iBLAST to perform an incremental sequence similarity search. Developing new statistics to combine search results over time makes incremental analysis feasible. iBLAST performs (1+δ)/δ times faster than NCBI BLAST, where δ represents the fraction of database growth. We also explored various approaches to mitigate catastrophic forgetting in incremental training of deep learning models.