Browsing by Author "Boedihardjo, Arnold P."
Now showing 1 - 6 of 6
Results Per Page
Sort Options
- A Framework for the Expansion of Spatial Features Based on Semantic FootprintsDos Santos Jr, Raimundo F.; Boedihardjo, Arnold P.; Lu, Chang-Tien (Department of Computer Science, Virginia Polytechnic Institute & State University, 2011)Geographic feature expansion is a common task in Geographic Information Systems (GIS). Identifying and integrating geographic features is a challenging task since many of their spatial and non-spatial properties are described in different sources. We tackle this expansion problem by defining semantic footprints as a measure of similarity among features. Furthermore, we propose three quantifiers of semantic similarity: spatial, dimensional, and ontological affinity. We show how these measures dilute, concentrate, harden, or concede the feature space, and provide useful insights into the semantic relationships of the spatial entities. Experiments demonstrate the effectiveness of our approach in semantically associating the most appropriate spatial features.
- GLS-SOD: A Generalized Local Statistical Approach for Spatial Outlier DetectionChen, Feng; Lu, Chang-Tien; Boedihardjo, Arnold P. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2010-03-01)Local based approach is a major category of methods for spatial outlier detection (SOD). Currently, there is a lack of systematic analysis on the statistical properties of this framework. For example, most methods assume identical and independent normal distributions (i.i.d. normal) for the calculated local differences, but no justifications for this critical assumption have been presented. The methods’ detection performance on geostatistic data with linear or nonlinear trend is also not well studied. In addition, there is a lack of theoretical connections and empirical comparisons between local and global based SOD approaches. This paper discusses all these fundamental issues under the proposed generalized local statistical (GLS) framework. Furthermore, robust estimation and outlier detection methods are designed for the new GLS model. Extensive simulations demonstrated that the SOD method based on the GLS model significantly outperformed all existing approaches when the spatial data exhibits a linear or nonlinear trend.
- Knowledge Discovery in Intelligence AnalysisButler, Patrick Julian Carey (Virginia Tech, 2014-06-03)Intelligence analysts today are faced with many challenges, chief among them being the need to fuse disparate streams of data, as well as rapidly arrive at analytical decisions and quantitative predictions for use by policy makers. These problems are further exacerbated by the sheer volume of data that is available to intelligence analysts. Machine learning methods enable the automated transduction of such large datasets from raw feeds to actionable knowledge but successful use of such methods require integrated frameworks for contextualizing them within the work processes of the analyst. Intelligence analysts typically distinguish between three classes of problems: collections, analysis, and operations. This dissertation specifically focuses on two problems in analysis: i) the reconstruction of shredded documents using a visual analytic framework combining computer vision techniques and user input, and ii) the design and implementation of a system for event forecasting which allows an analyst to not just consume forecasts of significant societal events but also understand the rationale behind these alerts and the use of data ablation techniques to determine the strength of conclusions. This work does not attempt to replace the role of the analyst with machine learning but instead outlines several methods to augment the analyst with machine learning. In doing so this dissertation also explores the responsibilities of an analyst in evaluating complex models and decisions made by these models. Finally, this dissertation defines a list of responsibilities for models designed to aid the analyst's work in evaluating and verifying the models.
- On Locally Linear Classification by Pairwise CouplingChen, Feng; Lu, Chang-Tien; Boedihardjo, Arnold P. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2008)Locally linear classification by pairwise coupling addresses a nonlinear classification problem by three basic phases: decompose the classes of complex concepts into linearly separable subclasses, learn a linear classifier for each pair, and combine pairwise classifiers into a single classifier. A number of methods have been proposed in this framework. However, these methods have several deficiencies: 1) lack of a systematic evaluation of the framework, 2) naive application of general clustering algorithms to generate subclasses, and 3) no valid method to estimate and optimal number of subclasses. This paper proves the equivalence between three popular combination schemas under general settings, defines several global criterion functions for measuring the goodness of subclasses, and presents a supervised greedy clustering algorithm to minimize the proposed criterion functions. Extensive experiments has also been conducted on a set of benchmark data to validate the effectiveness of the proposed techniques.
- Scalable Robust Models Under Adversarial Data CorruptionZhang, Xuchao (Virginia Tech, 2019-04-04)The presence of noise and corruption in real-world data can be inevitably caused by accidental outliers, transmission loss, or even adversarial data attacks. Unlike traditional random noise usually assume a specific distribution with low corruption ratio, the data collected from crowdsourcing or labeled by weak annotators can contain adversarial data corruption. More challenge, the adversarial data corruption can be arbitrary, unbounded and do not follow any specific distribution. In addition, in the era of data explosion, the fast-growing amount of data makes the robust models more difficult to handle large-scale data sets. This thesis focuses on the development of methods for scalable robust models under the adversarial data corruption assumptions. Four methods are proposed, including robust regression via heuristic hard-thresholding, online and distributed robust regression with adversarial noises, self-paced robust learning for leveraging clean labels in noisy data, and robust regression via online feature selection with adversarial noises. Moreover, I extended the self-paced robust learning method to its distributed version for the scalability of the proposed algorithm, named distributed self-paced learning in alternating direction method of multiplier. Last, a robust multi-factor personality prediction model is proposed to hand the correlated data noises. For the first method, existing solutions for robust regression lack rigorous recovery guarantee of regression coefficients under the adversarial data corruption with no prior knowledge of corruption ratio. The proposed contributions of our work include: (1) Propose efficient algorithms to address the robust least-square regression problem; (2) Design effective approaches to estimate the corruption ratio; (3) Provide a rigorous robustness guarantee for regression coefficient recovery; and (4) Conduct extensive experiments for performance evaluation. For the second method, existing robust learning methods typically focus on modeling the entire dataset at once; however, they may meet the bottleneck of memory and computation as more and more datasets are becoming too large to be handled integrally. The proposed contributions of our work for this task include: (1) Formulate a framework for the scalable robust least-squares regression problem; (2) Propose online and distributed algorithms to handle the adversarial corruption; (3) Provide a rigorous robustness guarantee for regression coefficient recovery; and (4) Conduct extensive experiments for performance evaluations. For the third method, leveraging the prior knowledge of clean labels in noisy data is actually a crucial issue in practice, but existing robust learning methods typically focus more on eliminating noisy data. However, the data collected by ``weak annotator" or crowd-sourcing can be too noisy for existing robust methods to train an accurate model. Moreover, existing work that utilize additional clean labels are usually designed for some specific problems such as image classification. These methods typically utilize clean labels in large-scale noisy data based on their additional domain knowledge; however, these approaches are difficult to handle extremely noisy data and relied on their domain knowledge heavily, which makes them difficult be used in more general problems. The proposed contributions of our work for this task include: (1) Formulating a framework to leverage the clean labels in noisy data; (2) Proposing a self-paced robust learning algorithm to train models under the supervision of clean labels; (3) Providing a theoretical analysis for the convergence of the proposed algorithm; and (4) Conducting extensive experiments for performance evaluations. For the fourth method, the presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (textit{RoOFS}) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. We also prove that our algorithm has a restricted error bound compared to the optimal solution. Extensive empirical experiments in both synthetic and real-world data sets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency. For the fifth method, existing self-paced learning approaches typically focus on modeling the entire dataset at once; however, this may introduce a bottleneck in terms of memory and computation, as today's fast-growing datasets are becoming too large to be handled integrally. The proposed contributions of our work for this task include: (1) Reformulate the self-paced problem into a distributed setting.; (2) A distributed self-paced learning algorithm based on consensus ADMM is proposed to solve the textit{SPL} problem in a distributed setting; (3) A theoretical analysis is provided for the convergence of our proposed textit{DSPL} algorithm; and (4) Extensive experiments have been conducted utilizing both synthetic and real-world data based on a robust regression task. For the last method, personality prediction in multiple factors, such as openness and agreeableness, is growing in interest especially in the context of social media, which contains massive online posts or likes that can potentially reveal an individual's personality. However, the data collected from social media inevitably contains massive amounts of noise and corruption. To address it, traditional robust methods still suffer from several important challenges, including 1) existence of correlated corruption among multiple factors, 2) difficulty in estimating the corruption ratio in multi-factor data, and 3) scalability to massive datasets. This paper proposes a novel robust multi-factor personality prediction model that concurrently addresses all the above challenges by developing a distributed robust regression algorithm. Specifically, the algorithm optimizes regression coefficients of each factor in parallel with a heuristically estimated corruption ratio and then consolidates the uncorrupted set from multiple factors in two strategies: global consensus and majority voting. We also prove that our algorithm benefits from strong guarantees in terms of convergence rates and coefficient recovery, which can be utilized as a generic framework for the multi-factor robust regression problem with correlated corruption property. Extensive experiment on synthetic and real dataset demonstrates that our algorithm is superior to those of existing methods in both effectiveness and efficiency.
- Spatio-Temporal Storytelling on TwitterDos Santos Jr, Raimundo F.; Shah, Sumit; Chen, Feng; Boedihardjo, Arnold P.; Butler, Patrick; Lu, Chang-Tien; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2013-12-16)Social media, e.g.,Twitter, have provided us an unprecedented opportunity to observe events un-folding in real-time. The rapid pace at which situations play out on social media necessitates new tools for capturing and summarizing the spatio-temporal progression of events. This technical report describes methods for generating dynamic real-world storylines from Twitter Sources and shares the results of related experiments.