Algorithmic Distribution of Applied Learning on Big Data
MetadataShow full item record
Machine Learning and Graph techniques are complex and challenging to distribute. Generally, they are distributed by modeling the problem in a similar way as single node sequential techniques except applied on smaller chunks of data and compute and the results combined. These techniques focus on stitching the results from smaller chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs access to all the data during execution. In this work, we propose key-value pair based distribution techniques that are widely applicable to statistical machine learning techniques along with matrix, graph, and time series based algorithms. The crucial difference with previously proposed techniques is that all operations are modeled on key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve. This thesis focuses on key-value pair based distribution of applied machine learning techniques on a variety of problems. For the first method key-value pair distribution is used for storytelling at scale. Storytelling connects entities (people, organizations) using their observed relationships to establish meaningful storylines. When performed sequentially these computations become a bottleneck because the massive number of entities make space and time complexity untenable. We present DISCRN, or DIstributed Spatio-temporal ConceptseaRch based StorytelliNg, a distributed framework for performing spatio-temporal storytelling. The framework extracts entities from microblogs and event data, and links these entities using a novel ConceptSearch to derive storylines in a distributed fashion utilizing key-value pair paradigm. Performing these operations at scale allows deeper and broader analysis of storylines. The novel parallelization techniques speed up the generation and filtering of storylines on massive datasets. Experiments with microblog posts such as Twitter data and GDELT(Global Database of Events, Language and Tone) events show the efficiency of the techniques in DISCRN. The second work determines brand perception directly from people's comments in social media. Current techniques for determining brand perception, such as surveys of handpicked users by mail, in person, phone or online, are time consuming and increasingly inadequate. The proposed DERIV system distills storylines from open data representing direct consumer voice into a brand perception. The framework summarizes the perception of a brand in comparison to peer brands with in-memory key-value pair based distributed algorithms utilizing supervised machine learning techniques. Experiments performed with open data and models built with storylines of known peer brands show the technique as highly scalable and accurate in capturing brand perception from vast amounts of social data compared to sentiment analysis. The third work performs event categorization and prospect identification in social media. The problem is challenging due to endless amount of information generated daily. In our work, we present DISTL, an event processing and prospect identifying platform. It accepts as input a set of storylines (a sequence of entities and their relationships) and processes them as follows: (1) uses different algorithms (LDA, SVM, information gain, rule sets) to identify themes from storylines; (2) identifies top locations and times in storylines and combines with themes to generate events that are meaningful in a specific scenario for categorizing storylines; and (3) extracts top prospects as people and organizations from data elements contained in storylines. The output comprises sets of events in different categories and storylines under them along with top prospects identified. DISTL utilizes in-memory key-value pair based distributed processing that scales to high data volumes and categorizes generated storylines in near real-time. The fourth work builds flight paths of drones in a distributed manner to survey a large area taking images to determine growth of vegetation over power lines allowing for adjustment to terrain and number of drones and their capabilities. Drones are increasingly being used to perform risky and labor intensive aerial tasks cheaply and safely. To ensure operating costs are low and flights autonomous, their flight plans must be pre-built. In existing techniques drone flight paths are not automatically pre-calculated based on drone capabilities and terrain information. We present details of an automated flight plan builder DIMPL that pre-builds flight plans for drones tasked with surveying a large area to take photographs of electric poles to identify ones with hazardous vegetation overgrowth. DIMPL employs a distributed in-memory key-value pair based paradigm to process subregions in parallel and build flight paths in a highly efficient manner. The fifth work highlights scaling graph operations, particularly pruning and joins. Linking topics to specific experts in technical documents and finding connections between experts are crucial for detecting the evolution of emerging topics and the relationships between their influencers in state-of-the-art research. Current techniques that make such connections are limited to similarity measures. Methods based on weights such as TF-IDF and frequency to identify important topics and self joins between topics and experts are generally utilized to identify connections between experts. However, such approaches are inadequate for identifying emerging keywords and experts since the most useful terms in technical documents tend to be infrequent and concentrated in just a few documents. This makes connecting experts through joins on large dense graphs challenging. We present DIGDUG, a framework that identifies emerging topics by applying graph operations to technical terms. The framework identifies connections between authors of patents and journal papers by performing joins on connected topics and topics associated with the authors at scale. The problem of scaling the graph operations for topics and experts is solved through dense graph pruning and graph joins categorized under their own scalable separable dense graph class based on key-value pair distribution. Comparing our graph join and pruning technique against multiple graph and join methods in MapReduce revealed a significant improvement in performance using our approach.
General Audience Abstract
Distribution of Machine Learning and Graph algorithms is commonly performed by modeling the core algorithm in the same way as the sequential technique except implemented on distributed framework. This approach is satisfactory in very few cases, such as depth-first search and subgraph enumerations in graphs, k nearest neighbors, and few additional common methods. These techniques focus on stitching the results from smaller data or compute chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs to perform exhaustive computations on all the data during execution. In this work, we propose key-value pair based distribution techniques that are exhaustive and widely applicable to statistical machine learning algorithms along with matrix, graph, and time series based operations. The crucial difference with previously proposed techniques is that all operations are modeled as key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve.
- Doctoral Dissertations