Browsing by Author "Lu, Chang-Tien"
Now showing 1 - 20 of 124
Results Per Page
Sort Options
- Abnormal Pattern Recognition in Spatial DataKou, Yufeng (Virginia Tech, 2006-11-29)In the recent years, abnormal spatial pattern recognition has received a great deal of attention from both industry and academia, and has become an important branch of data mining. Abnormal spatial patterns, or spatial outliers, are those observations whose characteristics are markedly different from their spatial neighbors. The identification of spatial outliers can be used to reveal hidden but valuable knowledge in many applications. For example, it can help locate extreme meteorological events such as tornadoes and hurricanes, identify aberrant genes or tumor cells, discover highway traffic congestion points, pinpoint military targets in satellite images, determine possible locations of oil reservoirs, and detect water pollution incidents. Numerous traditional outlier detection methods have been developed, but they cannot be directly applied to spatial data in order to extract abnormal patterns. Traditional outlier detection mainly focuses on "global comparison" and identifies deviations from the remainder of the entire data set. In contrast, spatial outlier detection concentrates on discovering neighborhood instabilities that break the spatial continuity. In recent years, a number of techniques have been proposed for spatial outlier detection. However, they have the following limitations. First, most of them focus primarily on single-attribute outlier detection. Second, they may not accurately locate outliers when multiple outliers exist in a cluster and correlate with each other. Third, the existing algorithms tend to abstract spatial objects as isolated points and do not consider their geometrical and topological properties, which may lead to inexact results. This dissertation reports a study of the problem of abnormal spatial pattern recognition, and proposes a suite of novel algorithms. Contributions include: (1) formal definitions of various spatial outliers, including single-attribute outliers, multi-attribute outliers, and region outliers; (2) a set of algorithms for the accurate detection of single-attribute spatial outliers; (3) a systematic approach to identifying and tracking region outliers in continuous meteorological data sequences; (4) a novel Mahalanobis-distance-based algorithm to detect outliers with multiple attributes; (5) a set of graph-based algorithms to identify point outliers and region outliers; and (6) extensive analysis of experiments on several spatial data sets (e.g., West Nile virus data and NOAA meteorological data) to evaluate the effectiveness and efficiency of the proposed algorithms.
- Adaptive graph convolutional imputation network for environmental sensor data recoveryChen, Fanglan; Wang, Dongjie; Lei, Shuo; He, Jianfeng; Fu, Yanjie; Lu, Chang-Tien (Frontiers, 2022-11)Environmental sensors are essential for tracking weather conditions and changing trends, thus preventing adverse effects on species and environment. Missing values are inevitable in sensor recordings due to equipment malfunctions and measurement errors. Recent representation learning methods attempt to reconstruct missing values by capturing the temporal dependencies of sensor signals as handling time series data. However, existing approaches fall short of simultaneously capturing spatio-temporal dependencies in the network and fail to explicitly model sensor relations in a data-driven manner. In this work, we propose a novel Adaptive Graph Convolutional Imputation Network for missing value imputation in environmental sensor networks. A bidirectional graph convolutional gated recurrent unit module is introduced to extract spatio-temporal features which takes full advantage of the available observations from the target sensor and its neighboring sensors to recover the missing values. In addition, we design an adaptive graph learning layer that learns a sensor network topology in an end-to-end framework, in which no prior network information is needed for capturing spatial dependencies. Extensive experiments on three real-world environmental sensor datasets (solar radiation, air quality, relative humidity) in both in-sample and out-of-sample settings demonstrate the superior performance of the proposed framework for completing missing values in the environmental sensor network, which could potentially support environmental monitoring and assessment.
- Advanced Projection Ultrasound Imaging with CMOS-based Sensor Array: Development, Characterization, and Potential Medical ApplicationsLiu, Chu Chuan (Virginia Tech, 2009-12-17)Since early 1960s, ultrasound has become one of the most widely used medical imaging device as a diagnostic tool or an image guider for surgical intervention because of its high portability, non-ionization, non-invasiveness and low cost. Although continuous improvements in commercial equipments have been underway for many years, almost all systems are developed with pulse-echo geometry. In this research, a newly invented ultrasound sensor array was incorporated into the developments of a projection imaging system. Three C-scan prototypes, which included prototypes #1, #2 and an ultrasound mammography system, were constructed. Systematic and Evaluative studies included ultrasound CT, 3-D ultrasound, and multi-modality investigations were also performed. Furthermore, a new analytical method to model ultrasound forward scattering distribution (FSD) was developed by employing a specific annular apparatus. After applying this method, the scattering-corrected C-scan images revealed more detail structures as compared to unprocessed images. This new analytical modelling approach is believed to be effective for most imaging systems operating in projection geometry. In summary, while awaiting additional clinical validation, the C-scan ultrasound prototypes with the state-of-the-art PE-CMOS sensor arrays can provide veritable value and holds real and imminent promise in medical diagnostic imaging. Potential future uses of C-scan ultrasound include but not limit to computerized tomography, biopsy guidance, therapeutic device placing, foreign object detection, pediatric imaging, breast imaging, prostate imaging, human extremities imaging and live animal imaging. With continuous research and development, we believe that C-scan ultrasound has the potential to make a significant impact in the field of medical ultrasound imaging.
- ALERTA-Net: A Temporal Distance-Aware Recurrent Networks for Stock Movement and Volatility PredictionWang, Shengkun; Bai, Yangxiao; Fu, Kaiqun; Wang, Linhan; Lu, Chang-Tien; Ji, Taoran (ACM, 2023-11-06)For both investors and policymakers, forecasting the stock market is essential as it serves as an indicator of economic well-being. To this end, we harness the power of social media data, a rich source of public sentiment, to enhance the accuracy of stock market predictions. Diverging from conventional methods, we pioneer an approach that integrates sentiment analysis, macroeconomic indicators, search engine data, and historical prices within a multi-attention deep learning model, masterfully decoding the complex patterns inherent in the data. We showcase the state-of-the-art performance of our proposed model using a dataset, specifically curated by us, for predicting stock market movements and volatility.
- Algorithmic Distribution of Applied Learning on Big DataShukla, Manu (Virginia Tech, 2020-10-16)Machine Learning and Graph techniques are complex and challenging to distribute. Generally, they are distributed by modeling the problem in a similar way as single node sequential techniques except applied on smaller chunks of data and compute and the results combined. These techniques focus on stitching the results from smaller chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs access to all the data during execution. In this work, we propose key-value pair based distribution techniques that are widely applicable to statistical machine learning techniques along with matrix, graph, and time series based algorithms. The crucial difference with previously proposed techniques is that all operations are modeled on key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve. This thesis focuses on key-value pair based distribution of applied machine learning techniques on a variety of problems. For the first method key-value pair distribution is used for storytelling at scale. Storytelling connects entities (people, organizations) using their observed relationships to establish meaningful storylines. When performed sequentially these computations become a bottleneck because the massive number of entities make space and time complexity untenable. We present DISCRN, or DIstributed Spatio-temporal ConceptseaRch based StorytelliNg, a distributed framework for performing spatio-temporal storytelling. The framework extracts entities from microblogs and event data, and links these entities using a novel ConceptSearch to derive storylines in a distributed fashion utilizing key-value pair paradigm. Performing these operations at scale allows deeper and broader analysis of storylines. The novel parallelization techniques speed up the generation and filtering of storylines on massive datasets. Experiments with microblog posts such as Twitter data and GDELT(Global Database of Events, Language and Tone) events show the efficiency of the techniques in DISCRN. The second work determines brand perception directly from people's comments in social media. Current techniques for determining brand perception, such as surveys of handpicked users by mail, in person, phone or online, are time consuming and increasingly inadequate. The proposed DERIV system distills storylines from open data representing direct consumer voice into a brand perception. The framework summarizes the perception of a brand in comparison to peer brands with in-memory key-value pair based distributed algorithms utilizing supervised machine learning techniques. Experiments performed with open data and models built with storylines of known peer brands show the technique as highly scalable and accurate in capturing brand perception from vast amounts of social data compared to sentiment analysis. The third work performs event categorization and prospect identification in social media. The problem is challenging due to endless amount of information generated daily. In our work, we present DISTL, an event processing and prospect identifying platform. It accepts as input a set of storylines (a sequence of entities and their relationships) and processes them as follows: (1) uses different algorithms (LDA, SVM, information gain, rule sets) to identify themes from storylines; (2) identifies top locations and times in storylines and combines with themes to generate events that are meaningful in a specific scenario for categorizing storylines; and (3) extracts top prospects as people and organizations from data elements contained in storylines. The output comprises sets of events in different categories and storylines under them along with top prospects identified. DISTL utilizes in-memory key-value pair based distributed processing that scales to high data volumes and categorizes generated storylines in near real-time. The fourth work builds flight paths of drones in a distributed manner to survey a large area taking images to determine growth of vegetation over power lines allowing for adjustment to terrain and number of drones and their capabilities. Drones are increasingly being used to perform risky and labor intensive aerial tasks cheaply and safely. To ensure operating costs are low and flights autonomous, their flight plans must be pre-built. In existing techniques drone flight paths are not automatically pre-calculated based on drone capabilities and terrain information. We present details of an automated flight plan builder DIMPL that pre-builds flight plans for drones tasked with surveying a large area to take photographs of electric poles to identify ones with hazardous vegetation overgrowth. DIMPL employs a distributed in-memory key-value pair based paradigm to process subregions in parallel and build flight paths in a highly efficient manner. The fifth work highlights scaling graph operations, particularly pruning and joins. Linking topics to specific experts in technical documents and finding connections between experts are crucial for detecting the evolution of emerging topics and the relationships between their influencers in state-of-the-art research. Current techniques that make such connections are limited to similarity measures. Methods based on weights such as TF-IDF and frequency to identify important topics and self joins between topics and experts are generally utilized to identify connections between experts. However, such approaches are inadequate for identifying emerging keywords and experts since the most useful terms in technical documents tend to be infrequent and concentrated in just a few documents. This makes connecting experts through joins on large dense graphs challenging. We present DIGDUG, a framework that identifies emerging topics by applying graph operations to technical terms. The framework identifies connections between authors of patents and journal papers by performing joins on connected topics and topics associated with the authors at scale. The problem of scaling the graph operations for topics and experts is solved through dense graph pruning and graph joins categorized under their own scalable separable dense graph class based on key-value pair distribution. Comparing our graph join and pruning technique against multiple graph and join methods in MapReduce revealed a significant improvement in performance using our approach.
- Algorithms for Modeling Mass Movements and their Adoption in Social NetworksJin, Fang (Virginia Tech, 2016-08-23)Online social networks have become a staging ground for many modern movements, with the Arab Spring being the most prominent example. In an effort to understand and predict those movements, social media can be regarded as a valuable social sensor for disclosing underlying behaviors and patterns. To fully understand mass movement information propagation patterns in social networks, several problems need to be considered and addressed. Specifically, modeling mass movements that incorporate multiple spaces, a dynamic network structure, and misinformation propagation, can be exceptionally useful in understanding information propagation in social media. This dissertation explores four research problems underlying efforts to identify and track the adoption of mass movements in social media. First, how do mass movements become mobilized on Twitter, especially in a specific geographic area? Second, can we detect protest activity in social networks by observing group anomalies in graph? Third, how can we distinguish real movements from rumors or misinformation campaigns? and fourth, how can we infer the indicators of a specific type of protest, say climate related protest? A fundamental objective of this research has been to conduct a comprehensive study of how mass movement adoption functions in social networks. For example, it may cross multiple spaces, evolve with dynamic network structures, or consist of swift outbreaks or long term slowly evolving transmissions. In many cases, it may also be mixed with misinformation campaigns, either deliberate or in the form of rumors. Each of those issues requires the development of new mathematical models and algorithmic approaches such as those explored here. This work aims to facilitate advances in information propagation, group anomaly detection and misinformation distinction and, ultimately, help improve our understanding of mass movements and their adoption in social networks.
- Anomalous Information Detection in Social MediaTao, Rongrong (Virginia Tech, 2021-03-10)This dissertation focuses on identifying various types of anomalous information pattern in social media and news outlets. We focus on three types of anomalous information, including (1) media censorship in news outlets, which is information that should be published but is actually missing, (2) fake news in social media, which is unreliable information shown to the public, and (3) media propaganda in news outlets, which is trustworthy information but being over-populated. For the first problem, existing approaches on censorship detection mostly rely on monitoring posts in social media. However, media censorship in news outlets has not received nearly as much attention, mostly because it is difficult to systematically detect. The contributions of our work include: (1) a hypothesis testing framework to identify and evaluate censored clusters of keywords, (2) a near-linear-time algorithm to identify the highest scoring clusters as indicators of censorship, and (3) extensive experiments on six Latin American countries for performance evaluation. For the second problem, existing approaches studying fake news in social media primarily focus on topic-level modeling or prediction based on a set of aggregated features from a col- lection of posts. However, the credibility of various information components within the same topic can be quite different. The contributions of our work in this space include: (1) a new benchmark dataset for fake news research, (2) a cluster-based approach to improve instance- level prediction of information credibility, and (3) extensive experiments for performance evaluations. For the last problem, existing approaches to media propaganda detection primarily focus on investigating the pattern of information shared over social media or evaluation from domain experts. However, these approaches cannot be generalized to a large-scale analysis of media propaganda in news outlets. The contributions of our work include: (1) non- parametric scan statistics to identify clusters of over-populated keywords, (2) a near-linear-time algorithm to identify the highest scoring clusters as indicators of propaganda, and (3) extensive experiments on two Latin American countries for performance evaluation.
- Applying the 5S Framework To Integrating Digital LibrariesShen, Rao (Virginia Tech, 2006-04-17)We formalize the digital library (DL) integration problem and propose an overall approach based on the 5S (Streams, Structures, Spaces, Scenarios, and Societies) framework. We then apply that framework to integrate domain-specific (archaeological) DLs, illustrating our solutions for key problems in DL integration. An integrated Archaeological DL, ETANA-DL, is used as a case study to justify and evaluate our DL integration approach. We develop a minimum metamodel for archaeological DLs within the 5S theory. We implement the 5SSuite toolkit set to cover the process of union DL generation, including requirements gathering, conceptual modeling, rapid prototyping, and code generation. 5SSuite consists of 5SGraph, 5SGen, and SchemaMapper, which plays an important role during integration. SchemaMapper, a visual mapping tool, maps the schema of diverse DLs into a global schema for a union DL and generates a wrapper for each individual DL. Each wrapper transforms the metadata catalog of its DL to one conforming to the global schema. The converted catalogs are stored in the union catalog, so that the union DL has a global metadata format and union catalog. We also propose a formal approach to DL exploring services for integrated DLs based on 5S, which provides a systematic and functional method to design and implement DL exploring services. Finally, we propose a DL success model to assess integrated DLs from the perspective of DL end users by integrating 5S theory with diverse research on information systems success and adoption models, and information-seeking behavior models.
- Augmenting Dynamic Query Expansion in Microblog TextsKhandpur, Rupinder P. (Virginia Tech, 2018-08-17)Dynamic query expansion is a method of automatically identifying terms relevant to a target domain based on an incomplete query input. With the explosive growth of online media, such tools are essential for efficient search result refining to track emerging themes in noisy, unstructured text streams. It's crucial for large-scale predictive analytics and decision-making, systems which use open source indicators to find meaningful information rapidly and accurately. The problems of information overload and semantic mismatch are systemic during the Information Retrieval (IR) tasks undertaken by such systems. In this dissertation, we develop approaches to dynamic query expansion algorithms that can help improve the efficacy of such systems using only a small set of seed queries and requires no training or labeled samples. We primarily investigate four significant problems related to the retrieval and assessment of event-related information, viz. (1) How can we adapt the query expansion process to support rank-based analysis when tracking a fixed set of entities? A scalable framework is essential to allow relative assessment of emerging themes such as airport threats. (2) What visual knowledge discovery framework to adopt that can incorporate users' feedback back into the search result refinement process? A crucial step to efficiently integrate real-time `situational awareness' when monitoring specific themes using open source indicators. (3) How can we contextualize query expansions? We focus on capturing semantic relatedness between a query and reference text so that it can quickly adapt to different target domains. (4) How can we synchronously perform knowledge discovery and characterization (unstructured to structured) during the retrieval process? We mainly aim to model high-order, relational aspects of event-related information from microblog texts.
- Automated Urban Planning for Reimagining City Configuration via Adversarial Learning: Quantification, Generation, and EvaluationWang, Dongjie; Fu, Yanjie; Liu, Kunpeng; Chen, Fanglan; Wang, Pengyang; Lu, Chang-Tien (ACM, 2022)Urban planning refers to the efforts of designing land-use configurations given a region. However, there is a time-consuming and labor-intensive process for designing effective configurations, which motivates us to ask: can AI accelerate the urban planning process, so that human planners only adjust generated configurations for specific needs? The recent advance of deep generative models inspires us to automate urban planning from an adversarial learning perspective. However, three major challenges arise: 1) how to define a quantitative land-use configuration? 2) how to automate configuration planning? 3) how to evaluate the quality of a generated configuration? In this paper, we systematically address the three challenges. Specifically, 1) We define a land-use configuration as a longitude-latitude-channel tensor. 2) We formulate the automated urban planning problem into a task of deep generative learning. The objective is to generate a configuration tensor given the surrounding contexts of a target region. In particular, we first construct spatial graphs using geographic and human mobility data to learn graph representations. We then combine each target area and its surrounding context representations as a tuple, and categorize all tuples into positive (well-planned areas) and negative samples (poorly-planned areas). Next, we develop an adversarial learning framework, in which a generator takes the surrounding context representations as input to generate a land-use configuration, and a discriminator learns to distinguish between positive and negative samples. 3) We provide quantitative evaluation metrics and conduct extensive experiments to demonstrate the effectiveness of our framework.
- Automatic Question Answering and Knowledge Discovery from Electronic Health RecordsWang, Ping (Virginia Tech, 2021-08-25)Electronic Health Records (EHR) data contain comprehensive longitudinal patient information, which is usually stored in databases in the form of either multi-relational structured tables or unstructured texts, e.g., clinical notes. EHR provides a useful resource to assist doctors' decision making, however, they also present many unique challenges that limit the efficient use of the valuable information, such as large data volume, heterogeneous and dynamic information, medical term abbreviations, and noisy nature caused by misspelled words. This dissertation focuses on the development and evaluation of advanced machine learning algorithms to solve the following research questions: (1) How to seek answers from EHR for clinical activity related questions posed in human language without the assistance of database and natural language processing (NLP) domain experts, (2) How to discover underlying relationships of different events and entities in structured tabular EHRs, and (3) How to predict when a medical event will occur and estimate its probability based on previous medical information of patients. First, to automatically retrieve answers for natural language questions from the structured tables in EHR, we study the question-to-SQL generation task by generating the corresponding SQL query of the input question. We propose a translation-edit model driven by a language generation module and an editing module for the SQL query generation task. This model helps automatically translate clinical activity related questions to SQL queries, so that the doctors only need to provide their questions in natural language to get the answers they need. We also create a large-scale dataset for question answering on tabular EHR to simulate a more realistic setting. Our performance evaluation shows that the proposed model is effective in handling the unique challenges about clinical terminologies, such as abbreviations and misspelled words. Second, to automatically identify answers for natural language questions from unstructured clinical notes in EHR, we propose to achieve this goal by querying a knowledge base constructed based on fine-grained document-level expert annotations of clinical records for various NLP tasks. We first create a dataset for clinical knowledge base question answering with two sets: clinical knowledge base and question-answer pairs. An attention-based aspect-level reasoning model is developed and evaluated on the new dataset. Our experimental analysis shows that it is effective in identifying answers and also allows us to analyze the impact of different answer aspects in predicting correct answers. Third, we focus on discovering underlying relationships of different entities (e.g., patient, disease, medication, and treatment) in tabular EHR, which can be formulated as a link prediction problem in graph domain. We develop a self-supervised learning framework for better representation learning of entities across a large corpus and also consider local contextual information for the down-stream link prediction task. We demonstrate the effectiveness, interpretability, and scalability of the proposed model on the healthcare network built from tabular EHR. It is also successfully applied to solve link prediction problems in a variety of domains, such as e-commerce, social networks, and academic networks. Finally, to dynamically predict the occurrence of multiple correlated medical events, we formulate the problem as a temporal (multiple time-points) and multi-task learning problem using tensor representation. We propose an algorithm to jointly and dynamically predict several survival problems at each time point and optimize it with the Alternating Direction Methods of Multipliers (ADMM) algorithm. The model allows us to consider both the dependencies between different tasks and the correlations of each task at different time points. We evaluate the proposed model on two real-world applications and demonstrate its effectiveness and interpretability.
- Bayesian Integration and Modeling for Next-generation Sequencing Data AnalysisChen, Xi (Virginia Tech, 2016-07-01)Computational biology currently faces challenges in a big data world with thousands of data samples across multiple disease types including cancer. The challenging problem is how to extract biologically meaningful information from large-scale genomic data. Next-generation Sequencing (NGS) can now produce high quality data at DNA and RNA levels. However, in cells there exist a lot of non-specific (background) signals that affect the detection accuracy of true (foreground) signals. In this dissertation work, under Bayesian framework, we aim to develop and apply approaches to learn the distribution of genomic signals in each type of NGS data for reliable identification of specific foreground signals. We propose a novel Bayesian approach (ChIP-BIT) to reliably detect transcription factor (TF) binding sites (TFBSs) within promoter or enhancer regions by jointly analyzing the sample and input ChIP-seq data for one specific TF. Specifically, a Gaussian mixture model is used to capture both binding and background signals in the sample data; and background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. An Expectation-Maximization algorithm is used to learn the model parameters according to the distributions on binding signal intensity and binding locations. Extensive simulation studies and experimental validation both demonstrate that ChIP-BIT has a significantly improved performance on TFBS detection over conventional methods, particularly on weak binding signal detection. To infer cis-regulatory modules (CRMs) of multiple TFs, we propose to develop a Bayesian integration approach, namely BICORN, to integrate ChIP-seq and RNA-seq data of the same tissue. Each TFBS identified from ChIP-seq data can be either a functional binding event mediating target gene transcription or a non-functional binding. The functional bindings of a set of TFs usually work together as a CRM to regulate the transcription processes of a group of genes. We develop a Gibbs sampling approach to learn the distribution of CRMs (a joint distribution of multiple TFs) based on their functional bindings and target gene expression. The robustness of BICORN has been validated on simulated regulatory network and gene expression data with respect to different noise settings. BICORN is further applied to breast cancer MCF-7 ChIP-seq and RNA-seq data to identify CRMs functional in promoter or enhancer regions. In tumor cells, the normal regulatory mechanism may be interrupted by genome mutations, especially those somatic mutations that uniquely occur in tumor cells. Focused on a specific type of genome mutation, structural variation (SV), we develop a novel pattern-based probabilistic approach, namely PSSV, to identify somatic SVs from whole genome sequencing (WGS) data. PSSV features a mixture model with hidden states representing different mutation patterns; PSSV can thus differentiate heterozygous and homozygous SVs in each sample, enabling the identification of those somatic SVs with a heterozygous status in the normal sample and a homozygous status in the tumor sample. Simulation studies demonstrate that PSSV outperforms existing tools. PSSV has been successfully applied to breast cancer patient WGS data for identifying somatic SVs of key factors associated with breast cancer development. In this dissertation research, we demonstrate the advantage of the proposed distributional learning-based approaches over conventional methods for NGS data analysis. Distributional learning is a very powerful approach to gain biological insights from high quality NGS data. Successful applications of the proposed Bayesian methods to breast cancer NGS data shed light on underlying molecular mechanisms of breast cancer, enabling biologists or clinicians to identify major cancer drivers and develop new therapeutics for cancer treatment.
- Bayesian Modeling for Isoform Identification and Phenotype-specific Transcript AssemblyShi, Xu (Virginia Tech, 2017-10-24)The rapid development of biotechnology has enabled researchers to collect high-throughput data for studying various biological processes at the genomic level, transcriptomic level, and proteomic level. Due to the large noise in the data and the high complexity of diseases (such as cancer), it is a challenging task for researchers to extract biologically meaningful information that can help reveal the underlying molecular mechanisms. The challenges call for more efforts in developing efficient and effective computational methods to analyze the data at different levels so as to understand the biological systems in different aspects. In this dissertation research, we have developed novel Bayesian approaches to infer alternative splicing mechanisms in biological systems using RNA sequencing data. Specifically, we focus on two research topics in this dissertation: isoform identification and phenotype-specific transcript assembly. For isoform identification, we develop a computational approach, SparseIso, to jointly model the existence and abundance of isoforms in a Bayesian framework. A spike-and-slab prior is incorporated into the model to enforce the sparsity of expressed isoforms. A Gibbs sampler is developed to sample the existence and abundance of isoforms iteratively. For transcript assembly, we develop a Bayesian approach, IntAPT, to assemble phenotype-specific transcripts from multiple RNA sequencing profiles. A two-layer Bayesian framework is used to model the existence of phenotype-specific transcripts and the transcript abundance in individual samples. Based on the hierarchical Bayesian model, a Gibbs sampling algorithm is developed to estimate the joint posterior distribution for phenotype-specific transcript assembly. The performances of our proposed methods are evaluated with simulation data, compared with existing methods and benchmarked with real cell line data. We then apply our methods on breast cancer data to identify biologically meaningful splicing mechanisms associated with breast cancer. For the further work, we will extend our methods for de novo transcript assembly to identify novel isoforms in biological systems; we will incorporate isoform-specific networks into our methods to better understand splicing mechanisms in biological systems.
- ‘Beating the news’ with EMBERS: Forecasting Civil Unrest using Open Source IndicatorsRamakrishnan, Naren; Butler, Patrick; Self, Nathan; Khandpur, Rupinder P.; Saraf, Parang; Wang, Wei; Cadena, Jose; Vullikanti, Anil Kumar S.; Korkmaz, Gizem; Kuhlman, Christopher J.; Marathe, Achla; Zhao, Liang; Ting, Hua; Huang, Bert; Srinivasan, Aravind; Trinh, Khoa; Getoor, Lise; Katz, Graham; Doyle, Andy; Ackermann, Chris; Zavorin, Ilya; Ford, Jim; Summers, Kristen; Fayed, Youssef; Arredondo, Jaime; Gupta, Dipak; Mares, David; Muthia, Sathappan; Chen, Feng; Lu, Chang-Tien (2014)We describe the design, implementation, and evaluation of EMBERS, an automated, 24x7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. Unlike retrospective studies, EMBERS has been making forecasts into the future since Nov 2012 which have been (and continue to be) evaluated by an independent T&E team (MITRE). Of note, EMBERS has successfully forecast the uptick and downtick of incidents during the June 2013 protests in Brazil. We outline the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion and suppression engine that supports trading off specific evaluation criteria. EMBERS also provides an audit trail interface that enables the investigation of why specific predictions were made along with the data utilized for forecasting. Through numerous evaluations, we demonstrate the superiority of EMBERS over baserate methods and its capability to forecast significant societal happenings.
- Bivariate Best First Searches to Process Category Based Queries in a Graph for Trip Planning Applications in TransportationLu, Qifeng (Virginia Tech, 2009-02-25)With the technological advancement in computer science, Geographic Information Science (GIScience), and transportation, more and more complex path finding queries including category based queries are proposed and studied across diverse disciplines. A category based query, such as Optimal Sequenced Routing (OSR) queries and Trip Planning Queries (TPQ), asks for a minimum-cost path that traverses a set of categories with or without a predefined order in a graph. Due to the extensive computing time required to process these complex queries in a large scale environment, efficient algorithms are highly desirable whenever processing time is a consideration. In Artificial Intelligence (AI), a best first search is an informed heuristic path finding algorithm that uses domain knowledge as heuristics to expedite the search process. Traditional best first searches are single-variate in terms of the number of variables to describe a state, and thus not appropriate to process these queries in a graph. In this dissertation, 1) two new types of category based queries, Category Sequence Traversal Query (CSTQ) and Optimal Sequence Traversal Query (OSTQ), are proposed; 2) the existing single-variate best first searches are extended to multivariate best first searches in terms of the state specified, and a class of new concepts--state graph, sub state graph, sub state graph space, local heuristic, local admissibility, local consistency, global heuristic, global admissibility, and global consistency--is introduced into best first searches; 3) two bivariate best first search algorithms, C* and O*, are developed to process CSTQ and OSTQ in a graph, respectively; 4) for each of C* and O*, theorems on optimality and optimal efficiency in a sub state graph space are developed and identified; 5) a family of algorithms including C*-P, C-Dijkstra, O*-MST, O*-SCDMST, O*- Dijkstra, and O*-Greedy is identified, and case studies are performed on path finding in transportation networks, and/or fully connected graphs, either directed or undirected; and 6) O*- SCDMST is adopted to efficiently retrieve optimal solutions for OSTQ using network distance metric in a large transportation network.
- Bridging the Gap between Spatial and Spectral Domains: A Unified Framework for Graph Neural NetworksChen, Zhiqian; Chen, Fanglan; Zhang, Lei; Ji, Taoran; Fu, Kaiqun; Zhao, Liang; Chen, Feng; Wu, Lingfei; Aggarwal, Charu; Lu, Chang-Tien (ACM, 2023-10)Deep learning's performance has been extensively recognized recently. Graph neural networks (GNNs) are designed to deal with graph-structural data that classical deep learning does not easily manage. Since most GNNs were created using distinct theories, direct comparisons are impossible. Prior research has primarily concentrated on categorizing existing models, with little attention paid to their intrinsic connections. The purpose of this study is to establish a unified framework that integrates GNNs based on spectral graph and approximation theory. The framework incorporates a strong integration between spatial- and spectral-based GNNs while tightly associating approaches that exist within each respective domain.
- Building Matlab Standalone Package from Java for Differential Dependence Network Analysis Bioinformatics ToolkitJin, Lu (Virginia Tech, 2010-05-26)This thesis reports a software development effort to transplant Matlab algorithm into a Matlab license-free, platform dependent Java based software. The result is almost equivalent to a direct translation of Matlab source codes into Java or any other programming languages. Since compiled library is platform dependent, an MCR (Matlab Compiler Runtime environment) is required and has been developed to deploy the transplanted algorithm to end user. As the result, the implemented MCR is free to distribution and the streamline transplantation process is much simpler and more reliable than manually translation work. In addition, the implementation methodology reported here can be reused for other similar software engineering tasks. There are mainly 4 construction steps in our software package development. First, all Matlab *.m files or *.mex files associated with the algorithms of interest (to be transplanted) are gathered, and the corresponding shared library is created by the Matlab Compiler. Second, a Java driver is created that will serve as the final user interface. This Java based user interface will take care of all the input and output of the original Matlab algorithm, and prepare all native methods. Third, assisted by JNI, a C driver is implemented to manage the variable transfer between Matlab and Java. Lastly, Matlab mbuild function is used to compile the C driver and aforementioned shared library into a dependent library, ready to be called from the standalone Java interface. We use a caBIG™ (Cancer Biomedical Informatics Grid) data analytic toolkit, namely, the DDN (differential dependence network) algorithm as the testbed in the software development. The developed DDN standalone package can be used on any Matlab-supported platform with Java GUI (Graphic User Interface) or command line parameter. As a caBIG™ toolkit, the DDN package can be integrated into other information systems such as Taverna or G-DOC. The major benefits provided by the proposed methodology can be summarized as follows. First, the proposed software development framework offers a simple and effective way for algorithm developer to provide novel bioinformatics tools to the biomedical end-users, where the frequent obstacle is the lack of language-specific software runtime environment and incompatibility between the compiled software and available computer platforms at user's sites. Second, the proposed software development framework offers software developer a significant time/effort-saving method for translating code between different programming languages, where the majority of software developer's time/effort is spent on understanding the specific analytic algorithm and its language-specific codes rather than developing efficient and platform/user-friendly software. Third, the proposed methodology allows software engineers to focus their effort on the quality of software rather than the details of original source codes, where the only required information is the inputs and outputs of the algorithm. Specifically, all used variables and functions are mapped between Matlab, C and Java, handled solely by our designated C driver.
- Change Management of Long Term Composed ServicesLiu, Xumin (Virginia Tech, 2009-07-28)We propose a framework for managing changes in Long term Composed Services (LCSs). The key components of the proposed framework include a Web Service Change Management Language (SCML), change enactment, and change optimization. The SCML is a formal language to specify top-down changes. It is built upon a formal model which consists of a Web service ontology and a LCS schema. The Web service ontology gives a semantic description on the important features of a service, including functionality, quality, and context. The LCS schema gives a high-level overview of a LCS's key features. A top-down change is specified as the modification of a LCS schema in the first place. Change enactment is the process of reacting to a top-down change. It consists of two subcomponents, including change reaction and change verification. The change reaction component implements the proposed change operators by modifying a LCS schema and the membership of Web services. The change verification component ensures that the correctness of a LCS is maintained during the process of change reaction. We propose a set of algorithms for the processes of change reaction and verification. The change optimization component selects the Web services that participate in a LCS to ensure that the change has been reacted to in the best way. We propose a two-phase optimization process to select services using both service reputation and service quality. We present a change management system that implements the proposed approaches. We also conduct a set of simulations to assess the performance.
- Citation Forecasting with Multi-Context Attention-Aided Dependency ModelingJi, Taoran; Self, Nathan; Fu, Kaiqun; Chen, Zhiqian; Ramakrishnan, Naren; Lu, Chang-Tien (ACM, 2024)Forecasting citations of scientific patents and publications is a crucial task for understanding the evolution and development of technological domains and for foresight into emerging technologies. By construing citations as a time series, the task can be cast into the domain of temporal point processes. Most existing work on forecasting with temporal point processes, both conventional and neural network-based, only performs single-step forecasting. In citation forecasting, however, the more salient goal is n-step forecasting: predicting the arrival of the next n citations. In this paper, we propose Dynamic Multi-Context Attention Networks (DMA-Nets), a novel deep learning sequence-to-sequence (Seq2Seq) model with a novel hierarchical dynamic attention mechanism for long-term citation forecasting. Extensive experiments on two real-world datasets demonstrate that the proposed model learns better representations of conditional dependencies over historical sequences compared to state-of-the-art counterparts and thus achieves significant performance for citation predictions.
- CLUR: Uncertainty Estimation for Few-Shot Text Classification with Contrastive LearningHe, Jianfeng; Zhang, Xuchao; Lei, Shuo; Alhamadani, Abdulaziz; Chen, Fanglan; Xiao, Bei; Lu, Chang-Tien (ACM, 2023-08-06)Few-shot text classification has extensive application where the sample collection is expensive or complicated. When the penalty for classification errors is high, such as early threat event detection with scarce data, we expect to know “whether we should trust the classification results or reexamine them.” This paper investigates the Uncertainty Estimation for Few-shot Text Classification (UEFTC), an unexplored research area. Given limited samples, a UEFTC model predicts an uncertainty score for a classification result, which is the likelihood that the classification result is false. However, many traditional uncertainty estimation models in text classification are unsuitable for implementing a UEFTC model. These models require numerous training samples, whereas the few-shot setting in UEFTC only provides a few or just one support sample for each class in an episode. We propose Contrastive Learning from Uncertainty Relations (CLUR) to address UEFTC. CLUR can be trained with only one support sample for each class with the help of pseudo uncertainty scores. Unlike previous works that manually set the pseudo uncertainty scores, CLUR self-adaptively learns them using our proposed uncertainty relations. Specifically, we explore four model structures in CLUR to investigate the performance of three common-used contrastive learning components in UEFTC and find that two of the components are effective. Experiment results prove that CLUR outperforms six baselines on four datasets, including an improvement of 4.52% AUPR on an RCV1 dataset in a 5-way 1-shot setting. Our code and data split for UEFTC are in https: //github.com/he159ok/CLUR_UncertaintyEst_FewShot_TextCls.