Exploring the Landscape of Big Data Analytics Through Domain-Aware Algorithm Design
Abstract
Experimental and observational data emerging from various scientific domains necessitate
fast, accurate, and low-cost analysis of the data. While exploring the landscape
of big data analytics, multiple challenges arise from three characteristics of big data:
the volume, the variety, and the velocity. High volume and velocity of the data warrant
a large amount of storage, memory, and compute power while a large variety
of data demands cognition across domains. Addressing domain-intrinsic properties of
data can help us analyze the data efficiently through the frugal use of high-performance
computing (HPC) resources. In this thesis, we present our exploration of the data analytics
landscape with domain-aware approximate and incremental algorithm design.
We propose three guidelines targeting three properties of big data for domain-aware
big data analytics: (1) explore geometric and domain-specific properties of high dimensional
data for succinct representation, which addresses the volume property, (2) design
domain-aware algorithms through mapping of domain problems to computational problems,
which addresses the variety property, and (3) leverage incremental arrival of data
through incremental analysis and invention of problem-specific merging methodologies,
which addresses the velocity property. We demonstrate these three guidelines through
the solution approaches of three representative domain problems.
We present Claret, a fast and portable parallel weighted multi-dimensional scaling
(WMDS) tool, to demonstrate the application of the first guideline. It combines algorithmic
concepts extended from the stochastic force-based multi-dimensional scaling
(SF-MDS) and Glimmer. Claret computes approximate weighted Euclidean distances
by combining a novel data mapping called stretching and Johnson Lindestrauss' lemma
to reduce the complexity of WMDS from O(f(n)d) to O(f(n) log d). In demonstrating
the second guideline, we map the problem of identifying multi-hit combinations
of genetic mutations responsible for cancers to weighted set cover (WSC) problem by
leveraging the semantics of cancer genomic data obtained from cancer biology. Solving
the mapped WSC with an approximate algorithm, we identified a set of multi-hit
combinations that differentiate between tumor and normal tissue samples. To identify
three- and four-hits, which require orders of magnitude larger computational power, we
have scaled out the WSC algorithm on a hundred nodes of Summit supercomputer. In
demonstrating the third guideline, we developed a tool iBLAST to perform an incremental
sequence similarity search. Developing new statistics to combine search results
over time makes incremental analysis feasible. iBLAST performs (1+δ)/δ times faster
than NCBI BLAST, where δ represents the fraction of database growth. We also explored
various approaches to mitigate catastrophic forgetting in incremental training
of deep learning models.
General Audience Abstract
Experimental and observational data emerging from various scientific domains necessitate
fast, accurate, and low-cost analysis of the data. While exploring the landscape of
big data analytics, multiple challenges arise from three characteristics of big data: the
volume, the variety, and the velocity. Here volume represents the data's size, variety
represents various sources and formats of the data, and velocity represents the data
arrival rate. High volume and velocity of the data warrant a large amount of storage,
memory, and computational power. In contrast, a large variety of data demands
cognition across domains. Addressing domain-intrinsic properties of data can help
us analyze the data efficiently through the frugal use of high-performance computing
(HPC) resources. This thesis presents our exploration of the data analytics landscape
with domain-aware approximate and incremental algorithm design. We propose three
guidelines targeting three properties of big data for domain-aware big data analytics:
(1) explore geometric (pair-wise distance and distribution-related) and domain-specific
properties of high dimensional data for succinct representation, which addresses the volume
property, (2) design domain-aware algorithms through mapping of domain problems
to computational problems, which addresses the variety property, and (3) leverage
incremental data arrival through incremental analysis and invention of problem-specific
merging methodologies, which addresses the velocity property.
We demonstrate these three guidelines through the solution approaches of three representative
domain problems. We demonstrate the application of the first guideline
through the design and development of Claret. Claret is a fast and portable parallel
weighted multi-dimensional scaling (WMDS) tool that can reduce the dimension of
high-dimensional data points. In demonstrating the second guideline, we identify combinations
of cancer-causing gene mutations by mapping the problem to a well known
computational problem known as the weighted set cover (WSC) problem. We have
scaled out the WSC algorithm on a hundred nodes of Summit supercomputer to solve
the problem in less than two hours instead of an estimated hundred years. In demonstrating
the third guideline, we developed a tool iBLAST to perform an incremental
sequence similarity search. This analysis was made possible by developing new statistics
to combine search results over time. We also explored various approaches to mitigate
the catastrophic forgetting of deep learning models, where a model forgets to perform
machine learning tasks efficiently on older data in a streaming setting.
Collections
- Doctoral Dissertations [14901]