Data Analytics for Statistical Learning
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The prevalence of big data has rapidly changed the usage and mechanisms of data analytics within organizations. Big data is a widely-used term without a clear definition. The difference between big data and traditional data can be characterized by four Vs: velocity (speed at which data is generated), volume (amount of data generated), variety (the data can take on different forms), and veracity (the data may be of poor/unknown quality). As many industries begin to recognize the value of big data, organizations try to capture it through means such as: side-channel data in a manufacturing operation, unstructured text-data reported by healthcare personnel, various demographic information of households from census surveys, and the range of communication data that define communities and social networks.
Big data analytics generally follows this framework: first, a digitized process generates a stream of data, this raw data stream is pre-processed to convert the data into a usable format, the pre-processed data is analyzed using statistical tools. In this stage, called statistical learning of the data, analysts have two main objectives (1) develop a statistical model that captures the behavior of the process from a sample of the data (2) identify anomalies in the process.
However, several open challenges still exist in this framework for big data analytics. Recently, data types such as free-text data are also being captured. Although many established processing techniques exist for other data types, free-text data comes from a wide range of individuals and is subject to syntax, grammar, language, and colloquialisms that require substantially different processing approaches. Once the data is processed, open challenges still exist in the statistical learning step of understanding the data.
Statistical learning aims to satisfy two objectives, (1) develop a model that highlights general patterns in the data (2) create a signaling mechanism to identify if outliers are present in the data. Statistical modeling is widely utilized as researchers have created a variety of statistical models to explain everyday phenomena such as predicting energy usage behavior, traffic patterns, and stock market behaviors, among others. However, new applications of big data with increasingly varied designs present interesting challenges. Consider the example of free-text analysis posed above. There's a renewed interest in modeling free-text narratives from sources such as online reviews, customer complaints, or patient safety event reports, into intuitive themes or topics. As previously mentioned, documents describing the same phenomena can vary widely in their word usage and structure.
Another recent interest area of statistical learning is using the environmental conditions that people live, work, and grow in, to infer their quality of life. It is well established that social factors play a role in overall health outcomes, however, clinical applications of these social determinants of health is a recent and an open problem. These examples are just a few of many examples wherein new applications of big data pose complex challenges requiring thoughtful and inventive approaches to processing, analyzing, and modeling data.
Although a large body of research exists in the area of anomaly detection increasingly complicated data sources (such as side-channel related data or network-based data) present equally convoluted challenges. For effective anomaly-detection, analysts define parameters and rules, so that when large collections of raw data are aggregated, pieces of data that do not conform are easily noticed and flagged.
In this work, I investigate the different steps of the data analytics framework and propose improvements for each step, paired with practical applications, to demonstrate the efficacy of my methods. This paper focuses on the healthcare, manufacturing and social-networking industries, but the materials are broad enough to have wide applications across data analytics generally. My main contributions can be summarized as follows:
• In the big data analytics framework, raw data initially goes through a pre-processing step. Although many pre-processing techniques exist, there are several challenges in pre-processing text data and I develop a pre-processing tool for text data.
• In the next step of the data analytics framework, there are challenges in both statistical modeling and anomaly detection
o I address the research area of statistical modeling in two ways:
- There are open challenges in defining models to characterize text data. I introduce a community extraction model that autonomously aggregates text documents into intuitive communities/groups
- In health care, it is well established that social factors play a role in overall health outcomes however developing a statistical model that characterizes these relationships is an open research area. I developed statistical models for generalizing relationships between social determinants of health of a cohort and general medical risk factors
o I address the research area of anomaly detection in two ways:
- A variety of anomaly detection techniques exist already, however, some of these methods lack a rigorous statistical investigation thereby making them ineffective to a practitioner. I identify critical shortcomings to a proposed network based anomaly detection technique and introduce methodological improvements
- Manufacturing enterprises which are now more connected than ever are vulnerably to anomalies in the form of cyber-physical attacks. I developed a sensor-based side-channel technique for anomaly detection in a manufacturing process