Browsing by Author "Lu, Chang Tien"
Now showing 1 - 20 of 29
Results Per Page
Sort Options
- Analyzing Networks with Hypergraphs: Detection, Classification, and PredictionAlkulaib, Lulwah Ahmad KH M. (Virginia Tech, 2024-04-02)Recent advances in large graph-based models have shown great performance in a variety of tasks, including node classification, link prediction, and influence modeling. However, these graph-based models struggle to capture high-order relations and interactions among entities effectively, leading them to underperform in many real-world scenarios. This thesis focuses on analyzing networks using hypergraphs for detection, classification, and prediction methods in social media-related problems. In particular, we study four specific applications with four proposed novel methods: detecting topic-specific influential users and tweets via hypergraphs; detecting spatiotemporal, topic-specific, influential users and tweets using hypergraphs; augmenting data in hypergraphs to mitigate class imbalance issues; and introducing a novel hypergraph convolutional network model designed for the multiclass classification of mental health advice in Arabic tweets. For the first method, existing solutions for influential user detection did not consider topics that could produce incorrect results and inadequate performance in that task. The proposed contributions of our work include: 1) Developing a hypergraph framework that detects influential users and tweets. 2) Proposing an effective topic modeling method for short texts. 3) Performing extensive experiments to demonstrate the efficacy of our proposed framework. For the second method, we extend the first method by incorporating spatiotemporal information into our solution. Existing influencer detection methods do not consider spatiotemporal influencers in social media, although influence can be greatly affected by geolocation and time. The contributions of our work for this task include: 1) Proposing a hypergraph framework that spatiotemporally detects influential users and tweets. 2) Developing an effective topic modeling method for short texts that geographically provides the topic distribution. 3) Designing a spatiotemporal topic-specific influencer user ranking algorithm. 4) Performing extensive experiments to demonstrate the efficacy of our proposed framework. For the third method, we address the challenge of bot detection on social media platform X, where there's an inherent imbalance between genuine users and bots, a key factor leading to biased classifiers. Our approach leverages the rich structure of hypergraphs to represent X users and their interactions, providing a novel foundation for effective bot detection. The contributions of our work include: 1) Introducing a hypergraph representation of the X platform, where user accounts are nodes and their interactions form hyperedges, capturing the intricate relationships between users. 2) Developing HyperSMOTE to generate synthetic bot accounts within the hypergraph, ensuring a balanced training dataset while preserving the hypergraph's structure and semantics. 3) Designing a hypergraph neural network specifically for bot detection, utilizing node and hyperedge information for accurate classification. 4) Conducting comprehensive experiments to validate the effectiveness of our methods, particularly in scenarios with pronounced class imbalances. For the fourth method, we introduce a Hypergraph Convolutional Network model for classifying mental health advice in Arabic tweets. Our model distinguishes between valid and misleading advice, leveraging high-order word relations in short texts through hypergraph structures. Our extensive experiments demonstrate its effectiveness over existing methods. The key contributions of our work include: 1) Developing a hypergraph-based model for short text multiclass classification, capturing complex word relationships through hypergraph convolution. 2) Defining four types of hyperedges to encapsulate local and global contexts and semantic similarities in our dataset. 3) Conducting comprehensive experiments in which the proposed model outperforms several baseline models in classifying Arabic tweets, demonstrating its superiority. For the fifth method, we extended our previous Hypergraph Convolutional Network (HCN) model to be tailored for sarcasm detection across multiple low-resource languages. Our model excels in interpreting the subtle and context-dependent nature of sarcasm in short texts by exploiting the power of hypergraph structures to capture complex, high-order relationships among words. Through the construction of three hyperedge types, our model navigates the intricate semantic and sentiment differences that characterize sarcastic expressions. The key contributions of our research are as follows: 1) A hypergraph-based model was adapted for the task of sarcasm detection in five short low-resource language texts, allowing the model to capture semantic relationships and contextual cues through advanced hypergraph convolution techniques. 2) Introducing a comprehensive framework for constructing hyperedges, incorporating short text, semantic similarity, and sentiment discrepancy hyperedges, which together enrich the model's ability to understand and detect sarcasm across diverse linguistic contexts. 3) The extensive evaluations reveal that the proposed hypergraph model significantly outperforms a range of established baseline methods in the domain of multilingual sarcasm detection, establishing new benchmarks for accuracy and generalizability in detecting sarcasm within low-resource languages.
- Autonomous Cyber Defense for Resilient Cyber-Physical SystemsZhang, Qisheng (Virginia Tech, 2024-01-09)In this dissertation research, we design and analyze resilient cyber-physical systems (CPSs) under high network dynamics, adversarial attacks, and various uncertainties. We focus on three key system attributes to build resilient CPSs by developing a suite of the autonomous cyber defense mechanisms. First, we consider network adaptability to achieve the resilience of a CPS. Network adaptability represents the network ability to maintain its security and connectivity level when faced with incoming attacks. We address this by network topology adaptation. Network topology adaptation can contribute to quickly identifying and updating the network topology to confuse attacks by changing attack paths. We leverage deep reinforcement learning (DRL) to develop CPSs using network topology adaptation. Second, we consider the fault-tolerance of a CPS as another attribute to ensure system resilience. We aim to build a resilient CPS under severe resource constraints, adversarial attacks, and various uncertainties. We chose a solar sensor-based smart farm as one example of the CPS applications and develop a resource-aware monitoring system for the smart farms. We leverage DRL and uncertainty quantification using a belief theory, called Subjective Logic, to optimize critical tradeoffs between system performance and security under the contested CPS environments. Lastly, we study system resilience in terms of system recoverability. The system recoverability refers to the system's ability to recover from performance degradation or failure. In this task, we mainly focus on developing an automated intrusion response system (IRS) for CPSs. We aim to design the IRS with effective and efficient responses by reducing a false alarm rate and defense cost, respectively. Specifically, We build a lightweight IRS for an in-vehicle controller area network (CAN) bus system operating with DRL-based autonomous driving.
- Bilevel Optimization in the Deep Learning Era: Methods and ApplicationsZhang, Lei (Virginia Tech, 2024-01-05)Neural networks, coupled with their associated optimization algorithms, have demonstrated remarkable efficacy and versatility across an extensive array of tasks, encompassing image recognition, speech recognition, object detection, sentiment analysis, and more. The inherent strength of neural networks lies in their capability to autonomously learn intricate representations that map input data to corresponding output labels seamlessly. Nevertheless, not all tasks can be neatly encapsulated within the confines of an end-to-end learning paradigm. The complexity and diversity of real-world challenges necessitate innovative approaches that extend beyond conventional formulations. This calls for the exploration of specialized architectures and optimization strategies tailored to the unique intricacies of specific tasks, ensuring a more nuanced and effective solution to the myriad demands of diverse applications. The bi-level optimization problem stands out as a distinctive form of optimization, characterized by the embedding or nesting of one problem within another. Its relevance persists significantly in the current era dominated by deep learning. A notable instance of its application in the realm of deep learning is observed in hyperparameter optimization. In the context of neural networks, the automatic training of weights through backpropagation represents a crucial aspect. However, certain hyperparameters, such as the learning rate (lr) and the number of layers, must be predetermined and cannot be optimized through the conventional chain rule employed in backpropagation. This underscores the importance of bi-level optimization in addressing the intricate task of fine-tuning these hyperparameters to enhance the overall performance of deep learning models. The domain of deep learning presents a fertile ground for further exploration and discoveries in optimization. The untapped potential for refining hyperparameters and optimizing various aspects of neural network architectures highlights the ongoing opportunities for advancements and breakthroughs in this dynamic field. Within this thesis, we delve into significant bi-level optimization challenges, applying these techniques to pertinent real-world tasks. Given that bi-level optimization entails dual layers of optimization, we explore scenarios where neural networks are present in the upper-level, the inner-level, or both. To be more specific, we systematically investigate four distinct tasks: optimizing neural networks towards optimizing neural networks, optimizing attractors towards optimizing neural networks, optimizing graph structures towards optimizing neural network performance, and optimizing architecture towards optimizing neural networks. For each of these tasks, we formulate the problems using the bi-level optimization approach mathematically, introducing more efficient optimization strategies. Furthermore, we meticulously evaluate the performance and efficiency of our proposed techniques. Importantly, our methodologies and insights transcend the realm of bi-level optimization, extending their applicability broadly to various deep learning models. The contributions made in this thesis offer valuable perspectives and tools for advancing optimization techniques in the broader landscape of deep learning.
- Building a Trustworthy Question Answering System for Covid-19 TrackingLiu, Yiqing (Virginia Tech, 2021-09-02)During the unprecedented global pandemic of Covid-19, the general public is suffering from inaccurate Covid-19 related information including outdated information and fake news. The most used media: TV, social media, newspaper, and radio are incompetent in providing certitude and flash updates that people are seeking. In order to cope with this challenge, several public data resources that are dedicated to providing Covid-19 information were born. They rallied with experts from different fields to provide authoritative and up-to-date pandemic updates. However, the general public cannot still make complete use of such resources since the learning curve is too steep, especially for the aged and under-aged users. To address this problem, in this Thesis, we propose a question answering system that can be interacted with using simple natural language-based sentences. While building this system, we investigate qualified public data resources and from the data content they are providing, and we collect a set of frequently asked questions for Covid-19 tracking. We further build a dedicated dataset named CovidQA for evaluating the performance of the question answering system with different models. Based on the new dataset, we assess multiple machine learning-based models that are built for retrieving relevant information from databases, and then propose two empirical models which utilize the pre-defined templates to generate SQL queries. In our experiments, we demonstrate both quantitative and qualitative results and provide a comprehensive comparison between different types of methods. The results show that the proposed template-based methods are simple but effective in building question answering systems for specific domain problems.
- Can an LLM find its way around a Spreadsheet?Lee, Cho Ting (Virginia Tech, 2024-06-05)Spreadsheets are routinely used in business and scientific contexts, and one of the most vexing challenges data analysts face is performing data cleaning prior to analysis and evaluation. The ad-hoc and arbitrary nature of data cleaning problems, such as typos, inconsistent formatting, missing values, and a lack of standardization, often creates the need for highly specialized pipelines. We ask whether an LLM can find its way around a spreadsheet and how to support end-users in taking their free-form data processing requests to fruition. Just like RAG retrieves context to answer users' queries, we demonstrate how we can retrieve elements from a code library to compose data processing pipelines. Through comprehensive experiments, we demonstrate the quality of our system and how it is able to continuously augment its vocabulary by saving new codes and pipelines back to the code library for future retrieval.
- CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case StudyDjoufack Basso, Larissa (Virginia Tech, 2022-06-21)Satellite imagery research used to be an expensive research topic for companies and organizations due to the limited data and compute resources. As the computing power and storage capacity grows exponentially, a large amount of aerial and satellite images are generated and analyzed everyday for various applications. Current technological advancement and extensive data collection by numerous Internet of Things (IOT) devices and platforms have amplified labeled natural images. Such data availability catalyzed the development and performance of current state-of-the-art image classification and cross-modal models. Despite the abundance of publicly available remote sensing images, very few remote sensing (RS) images are labeled and even fewer are multi-captioned.These scarcities limit the scope of fine tuned state of the art models to at most 38 classes, based on the PatternNet data, one of the largest publicly available labeled RS data. Recent state-of-the art image-to-image retrieval and detection models in RS have shown great results. Because the text-to-image retrieval of RS images is still emerging, it still faces some challenges in the retrieval of those images.These challenges are based on the inaccurate retrieval of image categories that were not present in the training dataset and the retrieval of images from descriptive input. Motivated by those shortcomings in current cross-modal remote sensing image retrieval, we proposed CLIP-RS, a cross-modal remote sensing image retrieval platform. Our proposed framework CLIP-RS is a framework that combines a fine-tuned implementation of a recent state of the art cross-modal and text-based image retrieval model, Contrastive Language Image Pre-training (CLIP) and FAISS (Facebook AI similarity search), a library for efficient similarity search. Our implementation is deployed on a Web App for inference task on text-to-image and image-to-image retrieval of RS images collected via the Mapbox GL JS API. We used the free tier option of the Mapbox GL JS API and took advantage of its raster tiles option to locate the retrieved results on a local map, a combination of the downloaded raster tiles. Other options offered on our platform are: image similarity search, locating an image in the map, view images' geocoordinates and addresses.In this work we also proposed two remote sensing fine-tuned models and conducted a comparative analysis of our proposed models with a different fine-tuned model as well as the zeroshot CLIP model on remote sensing data.
- Continuously Extensible Information Systems: Extending the 5S Framework by Integrating UX and WorkflowsChandrasekar, Prashant (Virginia Tech, 2021-06-11)In Virginia Tech's Digital Library Research Laboratory, we support subject-matter-experts (SMEs) in their pursuit of research goals. Their goals include everything from data collection to analysis to reporting. Their research commonly involves an analysis of an extensive collection of data such as tweets or web pages. Without support -- such as by our lab, developers, or data analysts/scientists -- they would undertake the data analysis themselves, using available analytical tools, frameworks, and languages. Then, to extract and produce the information needed to achieve their goals, the researchers/users would need to know what sequences of functions or algorithms to run using such tools, after considering all of their extensive functionality. Our research addresses these problems directly by designing a system that lowers the information barriers. Our approach is broken down into three parts. In the first two parts, we introduce a system that supports discovery of both information and supporting services. In the first part, we describe the methodology that incorporates User eXperience (UX) research into the process of workflow design. Through the methodology, we capture (a) what are the different user roles and goals, (b) how we break down the user goals into tasks and sub-tasks, and (c) what functions and services are required to solve each (sub-)task. In the second part, we identify and describe key components of the infrastructure implementation. This implementation captures the various goals/tasks/services associations in a manner that supports information inquiry of two types: (1) Given an information goal as query, what is the workflow to derive this information? and (2) Given a data resource, what information can we derive using this data resource as input? We demonstrate both parts of the approach, describing how we teach and apply the methodology, with three case studies. In the third part of this research, we rely on formalisms used in describing digital libraries to explain the components that make up the information system. The formal description serves as a guide to support the development of information systems that generate workflows to support SME information needs. We also specifically describe an information system meant to support information goals that relate to Twitter data.
- Crowd Compositions for Bias Detection and Mitigation in Predicting RecidivismMhatre, Sakshi Manish (Virginia Tech, 2024-09-30)This thesis explores an approach to predicting recidivism by leveraging crowdsourcing, contrasting traditional judicial discretion and algorithmic models. Instead of relying on judges or algorithms, participants predicted the likelihood of re-offending using the COMPAS dataset, which includes demographic and criminal record information. The study analyzed both quantitative and qualitative data to assess biases in human versus algorithmic predictions. Findings reveal that homogeneous crowds reflect the biases of their composition, leading to more pronounced gender and racial biases. In contrast, heterogeneous crowds, with equal and random distributions, present a more balanced view, though underlying biases still emerge. Both gender and racial biases influence how re-offending risk is perceived, significantly impacting risk evaluations. Specifically, crowds rated African American offenders as less likely to re-offend compared to COMPAS, which assigned them higher risk scores, while Caucasian and Hispanic offenders were perceived as more likely to re-offend by crowds. Gender differences also emerged, with males rated as less likely to re-offend and females as more likely. This study highlights crowdsourcing's potential to mitigate biases and provides insights into balancing consistency and fairness in risk assessments.
- Detecting Irregular Network Activity with Adversarial Learning and Expert FeedbackRathinavel, Gopikrishna (Virginia Tech, 2022-06-15)Anomaly detection is a ubiquitous and challenging task relevant across many disciplines. With the vital role communication networks play in our daily lives, the security of these networks is imperative for smooth functioning of society. This thesis proposes a novel self-supervised deep learning framework CAAD for anomaly detection in wireless communication systems. Specifically, CAAD employs powerful adversarial learning and contrastive learning techniques to learn effective representations of normal and anomalous behavior in wireless networks. Rigorous performance comparisons of CAAD with several state-of-the-art anomaly detection techniques has been conducted and verified that CAAD yields a mean performance improvement of 92.84%. Additionally, CAAD is augmented with the ability to systematically incorporate expert feedback through a novel contrastive learning feedback loop to improve the learned representations and thereby reduce prediction uncertainty (CAAD-EF). CAAD-EF is a novel, holistic and widely applicable solution to anomaly detection.
- Estimate Flood Damage Using Satellite Images and Twitter DataSun, Stephen Wei-Hao (Virginia Tech, 2022-06-03)Recently it is obvious that climate change has became a critical topic for human society. As climate change becomes more severe, natural disasters caused by climate change have increasingly impacted humans. Most recently, Hurricane Ida killed 43 people across four states. Hurricane Ida's damage could top $95 billion, and many meteorologists predict that climate change is making storms wetter and wider. Thus, there is an urgent need to predict how much damage the flood will cause and prepare for possible destruction. Most current flood damage estimation system did not apply social media data. The theme of this thesis was to evaluate the feasibility of using machine learning models to predict hurricane damage and the input data are social media and satellite imagery. This work involves developing Data Mining approach and a couple of different Machine Learning models that further extract the feature from the data. Satellite imagery is used to identify changes in building structures as well as landscapes, and Twitter data is used to identify damaged locations and the severity of the damage. The features of Twitter posts and satellite imagery were extracted through pre-trained GloVe, ResNet, and VGG models separately. The embedding features were then fed to MLP models for damage level estimation. The models were trained and evaluated on the data. Finally, a case study was performed on the test dataset for hints on improving the models.
- A Framework for Automated Discovery and Analysis of Suspicious Trade RecordsDatta, Debanjan (Virginia Tech, 2022-05-27)Illegal logging and timber trade presents a persistent threat to global biodiversity and national security due to its ties with illicit financial flows, and causes revenue loss. The scale of global commerce in timber and associated products, combined with the complexity and geographical spread of the supply chain entities present a non-trivial challenge in detecting such transactions. International shipment records, specifically those containing bill of lading is a key source of data which can be used to detect, investigate and act upon such transactions. The comprehensive problem can be described as building a framework that can perform automated discovery and facilitate actionability on detected transactions. A data driven machine learning based approach is necessitated due to the volume, velocity and complexity of international shipping data. Such an automated framework can immensely benefit our targeted end-users---specifically the enforcement agencies. This overall problem comprises of multiple connected sub-problems with associated research questions. We incorporate crucial domain knowledge---in terms of data as well as modeling---through employing expertise of collaborating domain specialists from ecological conservationist agencies. The collaborators provide formal and informal inputs spanning across the stages---from requirement specification to the design. Following the paradigm of similar problems such as fraud detection explored in prior literature, we formulate the core problem of discovering suspicious transactions as an anomaly detection task. The first sub-problem is to build a system that can be used find suspicious transactions in shipment data pertaining to imports and exports of multiple countries with different country specific schema. We present a novel anomaly detection approach---for multivariate categorical data, following constraints of data characteristics, combined with a data pipeline that incorporates domain knowledge. The focus of the second problem is U.S. specific imports, where data characteristics differ from the prior sub-problem---with heterogeneous attributes present. This problem is important since U.S. is a top consumer and there is scope of actionable enforcement. For this we present a contrastive learning based anomaly detection model for heterogeneous tabular data, with performance and scalability characteristics applicable to real world trade data. While the first two problems address the task of detecting suspicious trades through anomaly detection, a practical challenge with anomaly detection based systems is that of relevancy or scenario specific precision. The third sub-problem addresses this through a human-in-the-loop approach augmented by visual analytics, to re-rank anomalies in terms of relevance---providing explanations for cause of anomalies and soliciting feedback. The last sub-problem pertains to explainability and actionability towards suspicious records, through algorithmic recourse. Algorithmic recourse aims to provides meaningful alternatives towards flagged anomalous records, such that those counterfactual examples are not judged anomalous by the underlying anomaly detection system. This can help enforcement agencies advise verified trading entities in modifying their trading patterns to avoid false detection, thus streamlining the process. We present a novel formulation and metrics for this unexplored problem of algorithmic recourse in anomaly detection. and a deep learning based approach towards explaining anomalies and generating counterfactuals. Thus the overall research contributions presented in this dissertation addresses the requirements of the framework, and has general applicability in similar scenarios beyond the scope of this framework.
- Generative Chatbot Framework for Cybergrooming PreventionWang, Pei (Virginia Tech, 2021-12-20)Cybergrooming refers to the crime of establishing personal close relationships with potential victims, commonly teens, for the purpose of sexual exploitation or abuse via online social media platforms. Cybergrooming has been recognized as a serious social problem. However, there have been insufficient programs to provide proactive prevention to protect the youth users from cybergrooming. In this thesis, we present a generative chatbot framework, called SERI (Stop cybERgroomIng), that can generate simulated conversations between a perpetrator chatbot and a potential victim chatbot. To realize the simulation of authentic conversations in the context of cybergrooming, we take deep reinforcement learning (DRL)-based dialogue generation to simulate the authentic conversations between a perpetrator and a potential victim. The design and development of the SERI are motivated to provide a safe and authentic chatting environment to enhance the youth's precautionary awareness and sensitivity of cybergrooming while any unnecessary ethical issues (e.g., the potential misuse of the SERI) are removed or minimized. We developed the SERI as a preliminary platform that the perpetrator chatbot can be deployed in social media environments to interact with human users (i.e., youth) and observe the conversations that the youth users respond to strangers or acquaintances when they are asked for private or sensitive information by the perpetrator. We evaluated the quality of conversations generated by the SERI based on open-source, referenced, and unreferenced metrics as well as human evaluation. The evaluation results show that the SERI can generate authentic conversations between two chatbots compared to the original conversations from the used datasets in perplexity and MaUde scores.
- Information Extraction of Technical Details From Scholarly ArticlesKaushal, Kulendra Kumar (Virginia Tech, 2021-06-16)Researchers have made significant progress in information extraction from short documents in the last few years, including social media interaction, news articles, and email excerpts. This research aims to extract technical entities like hardware resources, computing platforms, compute time, programming language, and libraries from scholarly research articles. Research articles are generally long documents having both salient as well as non-salient entities. Analyzing the cross-sectional relation, filtering the relevant information, measuring the saliency of mentioned entities, and extracting novel entities are some of the technical challenges involved in this research. This work presents a detailed study about the performance, effectiveness, and scalability of rule-based weakly supervised algorithms. We also develop an automated end-to-end Research Entity and Relationship Extractor (E2R Extractor). Additionally, we perform a comprehensive study about the effectiveness of existing deep learning-based information extraction tools like Dygie, Dygie++, SciREX. The research also contributes a dataset containing novel entities annotated in BILUO format and represents the baseline results using the E2R extractor on the proposed dataset. The results indicate that the E2R extractor successfully extracts salient entities from research articles.
- Integrated Predictive Modeling and Analytics for Crisis ManagementAlhamadani, Abdulaziz Abdulrhman (Virginia Tech, 2024-05-15)The surge in the application of big data and predictive analytics in fields of crisis management, such as pandemics and epidemics, highlights the vital need for advanced research in these areas, particularly in the wake of the COVID-19 pandemic. Traditional methods, which typically rely on historical data to forecast future trends, fall short in addressing the complex and ever-changing nature of challenges like pandemics and public health crises. This inadequacy is further underscored by the pandemic's significant impact on various sectors, notably healthcare, government, and the hotel industry. Current models often overlook key factors such as static spatial elements, socioeconomic conditions, and the wealth of data available from social media, which are crucial for a comprehensive understanding and effective response to these multifaceted crises. This thesis employs spatial forecasting and predictive analytics to address crisis management in several distinct but interrelated contexts: the COVID-19 pandemic, the opioid crisis, and the impact of the pandemic on the hotel industry. The first part of the study focuses on using big data analytics to explore the relationship between socioeconomic factors and the spread of COVID-19 at the zip code level, aiming to predict high-risk areas for infection. The second part delves into the opioid crisis, utilizing semi-supervised deep learning techniques to monitor and categorize drug-related discussions on Reddit. The third part concentrates on developing spatial forecasting and providing explanations of the rising epidemic of drug overdose fatalities. The fourth part of the study extends to the realm of the hotel industry, aiming to optimize customer experience by analyzing online reviews and employing a localized Large Language Model to generate future customer trends and scenarios. Across these studies, the thesis aims to provide actionable insights and comprehensive solutions for effectively managing these major crises. For the first work, the majority of current research in pandemic modeling primarily relies on historical data to predict dynamic trends such as COVID-19. This work makes the following contributions in spatial COVID-19 pandemic forecasting: 1) the development of a unique model solely employing a wide range of socioeconomic indicators to forecast areas most susceptible to COVID-19, using detailed static spatial analysis, 2) identification of the most and least influential socioeconomic variables affecting COVID-19 transmission within communities, 3) construction of a comprehensive dataset that merges state-level COVID-19 statistics with corresponding socioeconomic attributes, organized by zip code. For the second work, we make the following contributions in detecting drug Abuse crisis via social media: 1) enhancing the Dynamic Query Expansion (DQE) algorithm to dynamically detect and extract evolving drug names in Reddit comments, utilizing a list curated from government and healthcare agencies, 2) constructing a textual Graph Convolutional Network combined with word embeddings to achieve fine-grained drug abuse classification in Reddit comments, identifying seven specific drug classes for the first time, 3) conducting extensive experiments to validate the framework, outperforming six baseline models in drug abuse classification and demonstrating effectiveness across multiple types of embeddings. The third study focuses on developing spatial forecasting and providing explanations of the escalating epidemic of drug overdose fatalities. Current research in this field has shown a deficiency in comprehensive explanations of the crisis, spatial analyses, and predictions of high-risk zones for drug overdoses. Addressing these gaps, this study contributes in several key areas: 1) Establishing a framework for spatially forecasting drug overdose fatalities predominantly affecting U.S. counties, 2) Proposing solutions for dealing with scarce and heterogeneous data sets, 3) Developing an algorithm that offers clear and actionable insights into the crisis, and 4) Conducting extensive experiments to validate the effectiveness of our proposed framework. In the fourth study, we address the profound impact of the pandemic on the hotel industry, focusing on the optimization of customer experience. Traditional methodologies in this realm have predominantly relied on survey data and limited segments of social media analytics. Those methods are informative but fall short of providing a full picture due to their inability to include diverse perspectives and broader customer feedback. Our study aims to make the following contributions: 1) the development of an integrated platform that distinguishes and extracts positive and negative Memorable Experiences (MEs) from online customer reviews within the hotel industry, 2) The incorporation of an advanced analytical module that performs temporal trend analysis of MEs, utilizing sophisticated data mining algorithms to dissect customer feedback on a monthly and yearly scale, 3) the implementation of an advanced tool that generates prospective and unexplored Memorable Experiences (MEs) by utilizing a localized Large Language Model (LLM) with keywords extracted from authentic customer experiences to aid hotel management in preparing for future customer trends and scenarios. Building on the integrated predictive modeling approaches developed in the earlier parts of this dissertation, this final section explores the significant impacts of the COVID-19 pandemic on the airline industry. The pandemic has precipitated substantial financial losses and operational disruptions, necessitating innovative crisis management strategies within this sector. This study introduces a novel analytical framework, EAGLE (Enhancing Airline Groundtruth Labels and Review rating prediction), which utilizes Large Language Models (LLMs) to improve the accuracy and objectivity of customer sentiment analysis in strategic airline route planning. EAGLE leverages LLMs for zero-shot pseudo-labeling and zero-shot text classification, to enhance the processing of customer reviews without the biases of manual labeling. This approach streamlines data analysis, and refines decision-making processes which allows airlines to align route expansions with nuanced customer preferences and sentiments effectively. The comprehensive application of LLMs in this context underscores the potential of predictive analytics to transform traditional crisis management strategies by providing deeper, more actionable insights.
- Learning with Limited Labeled Data: Techniques and ApplicationsLei, Shuo (Virginia Tech, 2023-10-11)Recent advances in large neural network-style models have demonstrated great performance in various applications, such as image generation, question answering, and audio classification. However, these deep and high-capacity models require a large amount of labeled data to function properly, rendering them inapplicable in many real-world scenarios. This dissertation focuses on the development and evaluation of advanced machine learning algorithms to solve the following research questions: (1) How to learn novel classes with limited labeled data, (2) How to adapt a large pre-trained model to the target domain if only unlabeled data is available, (3) How to boost the performance of the few-shot learning model with unlabeled data, and (4) How to utilize limited labeled data to learn new classes without the training data in the same domain. First, we study few-shot learning in text classification tasks. Meta-learning is becoming a popular approach for addressing few-shot text classification and has achieved state-of-the-art performance. However, the performance of existing approaches heavily depends on the interclass variance of the support set. To address this problem, we propose a TART network for few-shot text classification. The model enhances the generalization by transforming the class prototypes to per-class fixed reference points in task-adaptive metric spaces. In addition, we design a novel discriminative reference regularization to maximize divergence between transformed prototypes in task-adaptive metric spaces to improve performance further. In the second problem we focus on self-learning in cross-lingual transfer task. Our goal here is to develop a framework that can make the pretrained cross-lingual model continue learning the knowledge with large amount of unlabeled data. Existing self-learning methods in crosslingual transfer tasks suffer from the large number of incorrectly pseudo-labeled samples used in the training phase. We first design an uncertainty-aware cross-lingual transfer framework with pseudo-partial-labels. We also propose a novel pseudo-partial-label estimation method that considers prediction confidences and the limitation to the number of candidate classes. Next, to boost the performance of the few-shot learning model with unlabeled data, we propose a semi-supervised approach for few-shot semantic segmentation task. Existing solutions for few-shot semantic segmentation cannot easily be applied to utilize image-level weak annotations. We propose a class-prototype augmentation method to enrich the prototype representation by utilizing a few image-level annotations, achieving superior performance in one-/multi-way and weak annotation settings. We also design a robust strategy with softmasked average pooling to handle the noise in image-level annotations, which considers the prediction uncertainty and employs the task-specific threshold to mask the distraction. Finally, we study the cross-domain few-shot learning in the semantic segmentation task. Most existing few-shot segmentation methods consider a setting where base classes are drawn from the same domain as the new classes. Nevertheless, gathering enough training data for meta-learning is either unattainable or impractical in many applications. We extend few-shot semantic segmentation to a new task, called Cross-Domain Few-Shot Semantic Segmentation (CD-FSS), which aims to generalize the meta-knowledge from domains with sufficient training labels to low-resource domains. Then, we establish a new benchmark for the CD-FSS task and evaluate both representative few-shot segmentation methods and transfer learning based methods on the proposed benchmark. We then propose a novel Pyramid-AnchorTransformation based few-shot segmentation network (PATNet), in which domain-specific features are transformed into domain-agnostic ones for downstream segmentation modules to fast adapt to unseen domains.
- Leverage Fusion of Sentiment Features and Bert-based Approach to Improve Hate Speech DetectionCheng, Kai Hsiang (Virginia Tech, 2022-06-23)Social media has become an important place for modern people to conveniently share and exchange their ideas and opinions. However, not all content on the social media have positive impact. Hate speech is one kind of harmful content that people use abusive speech attacking or promoting hate towards a specific group or an individual. With online hate speech on the rise these day, people have explored ways to automatically recognize the hate speech, and among the ways people have studied, the Bert-based approach is promising and thus dominates SemEval-2019 Task 6, a hate speech detection competition. In this work, the method of fusion of sentiment features and Bert-based approach is proposed. The classic Bert architecture for hate speech detection is modified to fuse with additional sentiment features, provided by an extractor pre-trained on Sentiment140. The proposed model is compared with top-3 models in SemEval-2019 Task 6 Subtask A and achieves 83.1% F1 score that better than the models in the competition. Also, to see if additional sentiment features benefit the detectoin of hate speech, the features are fused with three kind of deep learning architectures respectively. The results show that the models with sentiment features perform better than those models without sentiment features.
- Machine learning enabled bioinformatics tools for analysis of biologically diverse samplesLu, Yingzhou (Virginia Tech, 2023-08-25)Advanced molecular profiling technologies, utilizing the entire human genome, have opened new avenues to study biological systems. In recent decades, the generation of vast volumes of multi-omics data, spanning a broad range of phenotypes. Development of advanced bioinformatics tools to identify informative biomarkers from these data becomes increasingly important. These tools are crucial to extract meaningful biomarkers from this data, especially for understanding the biological pathways responsible for disease development. The identification of signature genes and the analysis of differentially networked genes are two fundamental and critically important tasks. However, many current methodologies employ test statistics that don't align perfectly with the signature definition, potentially leading to the identification of imprecise signatures. It may be challenging because the test statistics employed by many prevailing methods fall short of fulfilling the exact definition of a marker genes, inherently leaving them susceptible to deriving inaccurate features. The problem is further compounded when attempting to identify marker genes across biologically diverse samples, especially when comparing more than two biological conditions. Additionally, traditional differential group analysis or co-expression analysis under singular conditions often falls short in certain scenarios. For instance, the subtle expression levels of transcription factors (TFs) make their detection daunting, despite their pivotal role in guiding gene expression. Pinpointing the intricate network landscape of complex ailments and isolating core genes for subsequent analysis are challenging tasks. Yet, these marker genes are instrumental in identifing potential pivotal pathways. Multi-omics data, with its inherent complexity and diversity, presents unique challenges that traditional methods might struggle to address effectively. Recognizing this, our team sought to introduce new and innovative techniques specifically designed to handle this intricate dataset. To overcome these challenges, it is vital to develop and adopt innovative methods tailored to handle the complexity and diversity inherent in multi-omics data. In response to these challenges, we have pioneered the Cosine-based One-sample Test (COT), a method meticulously crafted for the analysis of biologically diverse samples. Tailored to discern marker genes across a spectrum of subtypes using their expression profiles, COT employs a one-sample test framework. The test statistic within COT utilizes cosine similarity, comparing a molecule's expression profile across various subtypes with the precise mathematical representation of ideal marker genes. To ensure ease of application and accessibility, we've encapsulated the COT workflow within a Python package. To assess its effectiveness, we undertook an exhaustive evaluation, juxtaposing the marker genes detection capabilities of COT against its contemporaries. This evaluation employed realistic simulation data. Our findings indicated that COT was not only adept at handling gene expression data but was also proficient with proteomics data. This data, sourced from enriched tissue or cell subtype samples, further accentuated COT's superior performance. We demonstrated the heightened effectiveness of COT when applied to gene expression and proteomics data originating from distinct tissue or cell subtypes. This led to innovative findings and hypotheses in several biomedical case studies. Additionally, we have enhanced the Differential Dependency Network (DDN) framework to detect network rewiring between different conditions where significantly rewired network modes serve as informative biomarkers. Using cross-condition data and a block-wise Lasso network model, DDN detects significant network rewiring together with a subnetwork of hub molecular entities. In DDN 3.0, we took the imbalanced sample size into the consideration, integrated several acceleration strategies to enable it to handle large datasets, and enhanced the network presentation for more informative network displays including color-coded differential dependency network and gradient heatmap. We applied it to the simulated data and real data to detect critical changes in molecular network topology. The current tool stands as a valuable blueprint for the development and validation of mechanistic disease models. This foundation aids in offering a coherent interpretation of data, deepening our understanding of disease biology, and sparking new hypotheses ripe for subsequent validation and exploration. As we chart our future course, our vision is to expand the scope of tools like COT and DDN 3.0, explore the vast realm of multi-omics data, including those from longitudinal studies or clinical trials. We're looking at incorporating datasets from longitudinal studies and clinical trials – domains where data complexity scales to new heights. We believe that these tools can facilitate more nuanced and comprehensive understanding of disease development and progression. Furthermore, by integrating these methods with other advanced bioinformatics and machine learning tools, we aim to create a holistic pipeline that will allow for seamless extraction of significant biomarkers and actionable insights from multi-omics data. This is a promising step towards precision medicine, where individual genomic information can guide personalized treatment strategies.
- Message Authentication Codes On Ultra-Low SWaP DevicesLiao, Che-Hsien (Virginia Tech, 2022-05-27)This thesis focuses on specific crypto algorithms, Message Authentication Codes (MACs), running on ultra-low SWaP devices. The type of MACs we used is hash-based message authentication codes (HMAC) and cipher-block-chaining message authentication code (CBC-MAC). The most important thing about ultra-low SWaP devices is their energy usage. This thesis measures different implementations' execution times on ultra-low SWaP devices. We could understand which implementation is suitable for a specific device. In order to understand the crypto algorithm we used, this thesis briefly introduces the concept of hash-based message authentication codes (HMAC) and cipher-block-chaining message authentication code (CBC-MAC) from a high level, including their usage and advantage. The research method is empirical research. This thesis determines the execution times of different implementations. These two algorithms (HMAC and CBC-MAC) contain three implementations. The result comes from those implementations running on the devices we used.
- Metrohelper: A Real-time Web-based System for Metro Incident Detection Using Social MediaChen, Chih Fang (Virginia Tech, 2022-05-26)In recent years the usage of public transit services has been rapidly increased, thanks to huge progress on network technologies. However, the disruptions in modern public transit services also increased, due to aging infrastructure, non-comprehensive system design and the needs for maintenance. Any disruptions happened in current transit networks can cause to major disasters on passengers who use these networks for their daily commutes. Although we have lots of usage on transit network, still most current disruptions detection systems either lack of network coverage or did not have real-time system. The goal of this thesis was to create a system that can leverage Twitter data to help in detecting service disruptions in their early stage. This work involves a web applications which contains front-end, back-end and database, along with data mining techniques that obtain Tweets from a live Twitter stream related to the Washington Metropolitan Area Transit Authority (WMATA) metro system. The fundamental features of the system includes real-time incidents panel, historical events review, activities search near specific metro station and recent news review, which allowing people to have more relatively information based on their needs. After the initial functionalities is being settled, we further developed storytelling and sentiment analysis applications, which allowed people have more comprehensive information about the incidents that are happened around metro stations. Also, with the emergency report we developed, the developer can have immediate notification when an urgent event occurred. After fully testified the system's case study on storytelling, sentiment analysis and emergency report, the outcomes are extreme convincing and trustworthy.
- Multimodal Representation Learning for Textual Reasoning over Knowledge GraphsChoudhary, Nurendra (Virginia Tech, 2023-05-18)Knowledge graphs (KGs) store relational information in a flexible triplet schema and have become ubiquitous for information storage in domains such as web search, e-commerce, social networks, and biology. Retrieval of information from KGs is generally achieved through logical reasoning, but this process can be computationally expensive and has limited performance due to the large size and complexity of relationships within the KGs. Furthermore, to extend the usage of KGs to non-expert users, retrieval over them cannot solely rely on logical reasoning but also needs to consider text-based search. This creates a need for multi-modal representations that capture both the semantic and structural features from the KGs. The primary objective of the proposed work is to extend the accessibility of KGs to non-expert users/institutions by enabling them to utilize non-technical textual queries to search over the vast amount of information stored in KGs. To achieve this objective, the research aims to solve four limitations: (i) develop a framework for logical reasoning over KGs that can learn representations to capture hierarchical dependencies between entities, (ii) design an architecture that can effectively learn the logic flow of queries from natural language text, (iii) create a multi-modal architecture that can capture inherent semantic and structural features from the entities and KGs, respectively, and (iv) introduce a novel hyperbolic learning framework to enable the scalability of hyperbolic neural networks over large graphs using meta-learning. The proposed work is distinct from current research because it models the logical flow of textual queries in hyperbolic space and uses it to perform complex reasoning over large KGs. The models developed in this work are evaluated on both the standard research setting of logical reasoning, as well as, real-world scenarios of query matching and search, specifically, in the e-commerce domain. In summary, the proposed work aims to extend the accessibility of KGs to non-expert users by enabling them to use non-technical textual queries to search vast amounts of information stored in KGs. To achieve this objective, the work proposes the use of multi-modal representations that capture both semantic and structural features from the KGs, and a novel hyperbolic learning framework to enable scalability of hyperbolic neural networks over large graphs. The work also models the logical flow of textual queries in hyperbolic space to perform complex reasoning over large KGs. The models developed in this work are evaluated on both the standard research setting of logical reasoning and real-world scenarios in the e-commerce domain.