Browsing by Author "Reddy, Chandan K."
Now showing 1 - 20 of 55
Results Per Page
Sort Options
- Addressing Challenges of Modern News Agencies via Predictive Modeling, Deep Learning, and Transfer LearningKeneshloo, Yaser (Virginia Tech, 2019-07-22)Today's news agencies are moving from traditional journalism, where publishing just a few news articles per day was sufficient, to modern content generation mechanisms, which create more than thousands of news pieces every day. With the growth of these modern news agencies comes the arduous task of properly handling this massive amount of data that is generated for each news article. Therefore, news agencies are constantly seeking solutions to facilitate and automate some of the tasks that have been previously done by humans. In this dissertation, we focus on some of these problems and provide solutions for two broad problems which help a news agency to not only have a wider view of the behaviour of readers around the article but also to provide an automated tools to ease the job of editors in summarizing news articles. These two disjoint problems are aiming at improving the users' reading experience by helping the content generator to monitor and focus on poorly performing content while allow them to promote the good-performing ones. We first focus on the task of popularity prediction of news articles via a combination of regression, classification, and clustering models. We next focus on the problem of generating automated text summaries for a long news article using deep learning models. The first problem aims at helping the content developer in understanding of how a news article is performing over the long run while the second problem provides automated tools for the content developers to generate summaries for each news article.
- Anomalous Information Detection in Social MediaTao, Rongrong (Virginia Tech, 2021-03-10)This dissertation focuses on identifying various types of anomalous information pattern in social media and news outlets. We focus on three types of anomalous information, including (1) media censorship in news outlets, which is information that should be published but is actually missing, (2) fake news in social media, which is unreliable information shown to the public, and (3) media propaganda in news outlets, which is trustworthy information but being over-populated. For the first problem, existing approaches on censorship detection mostly rely on monitoring posts in social media. However, media censorship in news outlets has not received nearly as much attention, mostly because it is difficult to systematically detect. The contributions of our work include: (1) a hypothesis testing framework to identify and evaluate censored clusters of keywords, (2) a near-linear-time algorithm to identify the highest scoring clusters as indicators of censorship, and (3) extensive experiments on six Latin American countries for performance evaluation. For the second problem, existing approaches studying fake news in social media primarily focus on topic-level modeling or prediction based on a set of aggregated features from a col- lection of posts. However, the credibility of various information components within the same topic can be quite different. The contributions of our work in this space include: (1) a new benchmark dataset for fake news research, (2) a cluster-based approach to improve instance- level prediction of information credibility, and (3) extensive experiments for performance evaluations. For the last problem, existing approaches to media propaganda detection primarily focus on investigating the pattern of information shared over social media or evaluation from domain experts. However, these approaches cannot be generalized to a large-scale analysis of media propaganda in news outlets. The contributions of our work include: (1) non- parametric scan statistics to identify clusters of over-populated keywords, (2) a near-linear-time algorithm to identify the highest scoring clusters as indicators of propaganda, and (3) extensive experiments on two Latin American countries for performance evaluation.
- ANTHEM: Attentive Hyperbolic Entity Model for Product SearchChoudhary, Nurendra; Rao, Nikhil; Katariya, Sumeet; Subbian, Karthik; Reddy, Chandan K. (ACM, 2022-02-11)Product search is a fundamentally challenging problem due to the large-size of product catalogues and the complexity of extracting semantic information from products. In addition to this, the blackbox nature of most search systems also hamper a smooth customer experience. Current approaches in this area utilize lexical and semantic product information to match user queries against products. However, these models lack (i) a hierarchical query representation, (ii) a mechanism to detect and capture inter-entity relationships within a query, and (iii) a query composition method specific to e-commerce domain. To address these challenges, in this paper, we propose an AtteNTive Hyperbolic Entity Model (ANTHEM), a novel attention-based product search framework that models query entities as two-vector hyperboloids, learns inter-entity intersections and utilizes attention to unionize individual entities and inter-entity intersections to predict product matches from the search space. ANTHEM utilizes the first and second vector of hyperboloids to determine the query’s semantic position and to tune its surrounding search volume, respectively. The attention networks capture the significance of intra-entity and inter-entity intersections to the final query space. Additionally, we provide a mechanism to comprehend ANTHEM and understand the significance of query entities towards the final resultant products. We evaluate the performance of our model on real data collected from popular e-commerce sites. Our experimental study on the offline data demonstrates compelling evidence of ANTHEM’s superior performance over state-of-the-art product search methods with an improvement of more than 10% on various metrics. We also demonstrate the quality of ANTHEM’s query encoder using a query matching task.
- Augmenting Dynamic Query Expansion in Microblog TextsKhandpur, Rupinder P. (Virginia Tech, 2018-08-17)Dynamic query expansion is a method of automatically identifying terms relevant to a target domain based on an incomplete query input. With the explosive growth of online media, such tools are essential for efficient search result refining to track emerging themes in noisy, unstructured text streams. It's crucial for large-scale predictive analytics and decision-making, systems which use open source indicators to find meaningful information rapidly and accurately. The problems of information overload and semantic mismatch are systemic during the Information Retrieval (IR) tasks undertaken by such systems. In this dissertation, we develop approaches to dynamic query expansion algorithms that can help improve the efficacy of such systems using only a small set of seed queries and requires no training or labeled samples. We primarily investigate four significant problems related to the retrieval and assessment of event-related information, viz. (1) How can we adapt the query expansion process to support rank-based analysis when tracking a fixed set of entities? A scalable framework is essential to allow relative assessment of emerging themes such as airport threats. (2) What visual knowledge discovery framework to adopt that can incorporate users' feedback back into the search result refinement process? A crucial step to efficiently integrate real-time `situational awareness' when monitoring specific themes using open source indicators. (3) How can we contextualize query expansions? We focus on capturing semantic relatedness between a query and reference text so that it can quickly adapt to different target domains. (4) How can we synchronously perform knowledge discovery and characterization (unstructured to structured) during the retrieval process? We mainly aim to model high-order, relational aspects of event-related information from microblog texts.
- Automatic Question Answering and Knowledge Discovery from Electronic Health RecordsWang, Ping (Virginia Tech, 2021-08-25)Electronic Health Records (EHR) data contain comprehensive longitudinal patient information, which is usually stored in databases in the form of either multi-relational structured tables or unstructured texts, e.g., clinical notes. EHR provides a useful resource to assist doctors' decision making, however, they also present many unique challenges that limit the efficient use of the valuable information, such as large data volume, heterogeneous and dynamic information, medical term abbreviations, and noisy nature caused by misspelled words. This dissertation focuses on the development and evaluation of advanced machine learning algorithms to solve the following research questions: (1) How to seek answers from EHR for clinical activity related questions posed in human language without the assistance of database and natural language processing (NLP) domain experts, (2) How to discover underlying relationships of different events and entities in structured tabular EHRs, and (3) How to predict when a medical event will occur and estimate its probability based on previous medical information of patients. First, to automatically retrieve answers for natural language questions from the structured tables in EHR, we study the question-to-SQL generation task by generating the corresponding SQL query of the input question. We propose a translation-edit model driven by a language generation module and an editing module for the SQL query generation task. This model helps automatically translate clinical activity related questions to SQL queries, so that the doctors only need to provide their questions in natural language to get the answers they need. We also create a large-scale dataset for question answering on tabular EHR to simulate a more realistic setting. Our performance evaluation shows that the proposed model is effective in handling the unique challenges about clinical terminologies, such as abbreviations and misspelled words. Second, to automatically identify answers for natural language questions from unstructured clinical notes in EHR, we propose to achieve this goal by querying a knowledge base constructed based on fine-grained document-level expert annotations of clinical records for various NLP tasks. We first create a dataset for clinical knowledge base question answering with two sets: clinical knowledge base and question-answer pairs. An attention-based aspect-level reasoning model is developed and evaluated on the new dataset. Our experimental analysis shows that it is effective in identifying answers and also allows us to analyze the impact of different answer aspects in predicting correct answers. Third, we focus on discovering underlying relationships of different entities (e.g., patient, disease, medication, and treatment) in tabular EHR, which can be formulated as a link prediction problem in graph domain. We develop a self-supervised learning framework for better representation learning of entities across a large corpus and also consider local contextual information for the down-stream link prediction task. We demonstrate the effectiveness, interpretability, and scalability of the proposed model on the healthcare network built from tabular EHR. It is also successfully applied to solve link prediction problems in a variety of domains, such as e-commerce, social networks, and academic networks. Finally, to dynamically predict the occurrence of multiple correlated medical events, we formulate the problem as a temporal (multiple time-points) and multi-task learning problem using tensor representation. We propose an algorithm to jointly and dynamically predict several survival problems at each time point and optimize it with the Alternating Direction Methods of Multipliers (ADMM) algorithm. The model allows us to consider both the dependencies between different tasks and the correlations of each task at different time points. We evaluate the proposed model on two real-world applications and demonstrate its effectiveness and interpretability.
- Building a Trustworthy Question Answering System for Covid-19 TrackingLiu, Yiqing (Virginia Tech, 2021-09-02)During the unprecedented global pandemic of Covid-19, the general public is suffering from inaccurate Covid-19 related information including outdated information and fake news. The most used media: TV, social media, newspaper, and radio are incompetent in providing certitude and flash updates that people are seeking. In order to cope with this challenge, several public data resources that are dedicated to providing Covid-19 information were born. They rallied with experts from different fields to provide authoritative and up-to-date pandemic updates. However, the general public cannot still make complete use of such resources since the learning curve is too steep, especially for the aged and under-aged users. To address this problem, in this Thesis, we propose a question answering system that can be interacted with using simple natural language-based sentences. While building this system, we investigate qualified public data resources and from the data content they are providing, and we collect a set of frequently asked questions for Covid-19 tracking. We further build a dedicated dataset named CovidQA for evaluating the performance of the question answering system with different models. Based on the new dataset, we assess multiple machine learning-based models that are built for retrieving relevant information from databases, and then propose two empirical models which utilize the pre-defined templates to generate SQL queries. In our experiments, we demonstrate both quantitative and qualitative results and provide a comprehensive comparison between different types of methods. The results show that the proposed template-based methods are simple but effective in building question answering systems for specific domain problems.
- A Deep Learning Based Pipeline for Image Grading of Diabetic RetinopathyWang, Yu (Virginia Tech, 2018-06-21)Diabetic Retinopathy (DR) is one of the principal sources of blindness due to diabetes mellitus. It can be identified by lesions of the retina, namely microaneurysms, hemorrhages, and exudates. DR can be effectively prevented or delayed if discovered early enough and well-managed. Prior studies on diabetic retinopathy typically extract features manually but are time-consuming and not accurate. In this research, we propose a research framework using advanced retina image processing, deep learning, and a boosting algorithm for high-performance DR grading. First, we preprocess the retina image datasets to highlight signs of DR, then follow by a convolutional neural network to extract features of retina images, and finally apply a boosting tree algorithm to make a prediction based on extracted features. Experimental results show that our pipeline has excellent performance when grading diabetic retinopathy images, as evidenced by scores for both the Kaggle dataset and the IDRiD dataset.
- Deep Learning for Code Generation using Snippet Level Parallel DataJain, Aneesh (Virginia Tech, 2023-01-05)In the last few years, interest in the application of deep learning methods for software engineering tasks has surged. A variety of different approaches like transformer based methods, statistical machine translation models, models inspired from natural language settings have been proposed and shown to be effective at tasks like code summarization, code synthesis and code translation. Multiple benchmark data sets have also been released but all suffer from one limitation or the other. Some data sets only support a select few programming languages while others support only certain tasks. These limitations restrict researchers' ability to be able to perform thorough analyses of their proposed methods. In this work we aim to alleviate some of the limitations faced by researchers who work in the paradigm of deep learning applications for software engineering tasks. We introduce a large, parallel, multi-lingual programming language data set that supports tasks like code summarization, code translation, code synthesis and code search in 7 different languages. We provide benchmark results for the current state of the art models on all these tasks and we also explore some limitations of current evaluation metrics for code related tasks. We provide a detailed analysis of the compilability of code generated by deep learning models because that is a better measure of ascertaining usability of code as opposed to scores like BLEU and CodeBLEU. Motivated by our findings about compilability, we also propose a reinforcement learning based method that incorporates code compilability and syntax level feedback as rewards and we demonstrate it's effectiveness in generating code that has less syntax errors as compared to baselines. In addition, we also develop a web portal that hosts the models we have trained for code translation. The portal allows translation between 42 possible language pairs and also allows users to check compilability of the generated code. The intent of this website is to give researchers and other audiences a chance to interact with and probe our work in a user-friendly way, without requiring them to write their own code to load and inference the models.
- Deep Representation Learning on Labeled GraphsFan, Shuangfei (Virginia Tech, 2020-01-27)We introduce recurrent collective classification (RCC), a variant of ICA analogous to recurrent neural network prediction. RCC accommodates any differentiable local classifier and relational feature functions. We provide gradient-based strategies for optimizing over model parameters to more directly minimize the loss function. In our experiments, this direct loss minimization translates to improved accuracy and robustness on real network data. We demonstrate the robustness of RCC in settings where local classification is very noisy, settings that are particularly challenging for ICA. As a new way to train generative models, generative adversarial networks (GANs) have achieved considerable success in image generation, and this framework has also recently been applied to data with graph structures. We identify the drawbacks of existing deep frameworks for generating graphs, and we propose labeled-graph generative adversarial networks (LGGAN) to train deep generative models for graph-structured data with node labels. We test the approach on various types of graph datasets, such as collections of citation networks and protein graphs. Experiment results show that our model can generate diverse labeled graphs that match the structural characteristics of the training data and outperforms all baselines in terms of quality, generality, and scalability. To further evaluate the quality of the generated graphs, we apply it to a downstream task for graph classification, and the results show that LGGAN can better capture the important aspects of the graph structure.
- Defending Against Trojan Attacks on Neural Network-based Language ModelsAzizi, Ahmadreza (Virginia Tech, 2020-05-15)Backdoor (Trojan) attacks are a major threat to the security of deep neural network (DNN) models. They are created by an attacker who adds a certain pattern to a portion of given training dataset, causing the DNN model to misclassify any inputs that contain the pattern. These infected classifiers are called Trojan models and the added pattern is referred to as the trigger. In image domain, a trigger can be a patch of pixel values added to the images and in text domain, it can be a set of words. In this thesis, we propose Trojan-Miner (T-Miner), a defense scheme against such backdoor attacks on text classification deep learning models. The goal of T-Miner is to detect whether a given classifier is a Trojan model or not. To create T-Miner , our approach is based on a sequence-to-sequence text generation model. T-Miner uses feedback from the suspicious (test) classifier to perturb input sentences such that their resulting class label is changed. These perturbations can be different for each of the inputs. T-Miner thus extracts the perturbations to determine whether they include any backdoor trigger and correspondingly flag the suspicious classifier as a Trojan model. We evaluate T-Miner on three text classification datasets: Yelp Restaurant Reviews, Twitter Hate Speech, and Rotten Tomatoes Movie Reviews. To illustrate the effectiveness of T-Miner, we evaluate it on attack models over text classifiers. Hence, we build a set of clean classifiers with no trigger in their training datasets and also using several trigger phrases, we create a set of Trojan models. Then, we compute how many of these models are correctly marked by T-Miner. We show that our system is able to detect trojan and clean models with 97% overall accuracy over 400 classifiers. Finally, we discuss the robustness of T-Miner in the case that the attacker knows T-Miner framework and wants to use this knowledge to weaken T-Miner performance. To this end, we propose four different scenarios for the attacker and report the performance of T-Miner under these new attack methods.
- Design and Maintenance of Event Forecasting SystemsMuthiah, Sathappan (Virginia Tech, 2021-03-26)With significant growth in modern forms of communication such as social media and micro- blogs we are able to gain a real-time understanding into events happening in many parts of the world. In addition, these modern forms of communication have helped shed light into the increasing instabilities across the world via the design of anticipatory intelligence systems [45, 43, 20] that can forecast population level events like civil unrest, disease occurrences with reasonable accuracy. Event forecasting systems are generally prone to become outdated (model drift) as they fail to keep-up with constantly changing patterns and thus require regular re-training in order to sustain their accuracy and reliability. In this dissertation we try to address some of the issues associated with design and maintenance of event forecasting systems in general. We propose and showcase performance results for a drift adaptation technique in event forecasting systems and also build a hybrid system for event coding which is cognizant of and seeks human intervention in uncertain prediction contexts to maintain a good balance between prediction-fidelity and cost of human effort. Specifically we identify several micro-tasks for event coding and build separate pipelines for each with uncertainty estimation capabilities and thereby be able to seek human feedback whenever required for each micro-task independent of the rest.
- Detecting and Mitigating Rumors in Social MediaIslam, Mohammad Raihanul (Virginia Tech, 2020-06-19)The penetration of social media today enables the rapid spread of breaking news and other developments to millions of people across the globe within hours. However, such pervasive use of social media by the general masses to receive and consume news is not without its attendant negative consequences as it also opens opportunities for nefarious elements to spread rumors or misinformation. A rumor generally refers to an interesting piece of information that is widely disseminated through a social network and whose credibility cannot be easily substantiated. A rumor can later turn out to be true or false or remain unverified. The spread of misinformation and fake news can lead to deleterious effects on users and society. The objective of the proposed research is to develop a range of machine learning methods that will effectively detect and characterize rumor veracity in social media. Since users are the primary protagonists on social media, analyzing the characteristics of information spread w.r.t. users can be effective for our purpose. For our first problem, we propose a method of computing user embeddings from underlying social networks. For our second problem, we propose a long short-term memory (LSTM) based model that can classify whether a story discussed in a thread can be categorized as a false, true, or unverified rumor. We demonstrate the utility of user features computed from the first problem to address the second problem. For our third problem, we propose a method that uses user profile information to detect rumor veracity. This method has the advantage of not requiring the underlying social network, which can be tedious to compute. For the last problem, we investigate a rumor mitigation technique that recommends fact-checking URLs to rumor debunkers, i.e., social network users who are very passionate about disseminating true news. Here, we incorporate the influence of other users on rumor debunkers in addition to their previous URL sharing history to recommend relevant fact-checking URLs.
- Detecting Irregular Network Activity with Adversarial Learning and Expert FeedbackRathinavel, Gopikrishna (Virginia Tech, 2022-06-15)Anomaly detection is a ubiquitous and challenging task relevant across many disciplines. With the vital role communication networks play in our daily lives, the security of these networks is imperative for smooth functioning of society. This thesis proposes a novel self-supervised deep learning framework CAAD for anomaly detection in wireless communication systems. Specifically, CAAD employs powerful adversarial learning and contrastive learning techniques to learn effective representations of normal and anomalous behavior in wireless networks. Rigorous performance comparisons of CAAD with several state-of-the-art anomaly detection techniques has been conducted and verified that CAAD yields a mean performance improvement of 92.84%. Additionally, CAAD is augmented with the ability to systematically incorporate expert feedback through a novel contrastive learning feedback loop to improve the learned representations and thereby reduce prediction uncertainty (CAAD-EF). CAAD-EF is a novel, holistic and widely applicable solution to anomaly detection.
- Differentially private synthetic medical data generation using convolutional gansTorfi, Amirsina; Fox, Edward A.; Reddy, Chandan K. (Elsevier, 2022)Deep learning models have demonstrated superior performance in several real-world application problems such as image classification and speech processing. However, creating these models in sensitive domains like healthcare typically requires addressing certain privacy challenges that bring unique concerns. One effective way to handle such private data concerns is to generate realistic synthetic data that can provide practically acceptable data quality as well as be used to improve model performance. To tackle this challenge, we develop a differentially private framework for synthetic data generation using Rényi differential privacy. Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve critical characteristics of the generated synthetic data. In addition, our model can capture the temporal information and feature correlations present in the original data. We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget using several publicly available benchmark medical datasets in both supervised and unsupervised settings. The source code of this work is available at https://github.com/astorfi/differentially-private-cgan.
- Domain-based Frameworks and Embeddings for Dynamics over NetworksAdhikari, Bijaya (Virginia Tech, 2020-06-01)Broadly this thesis looks into network and time-series mining problems pertaining to dynamics over networks in various domains. Which locations and staff should we monitor in order to detect C. Difficile outbreaks in hospitals? How do we predict the peak intensity of the influenza incidence in an interpretable fashion? How do we infer the states of all nodes in a critical infrastructure network where failures have occurred? Leveraging domain-based information should make it is possible to answer these questions. However, several new challenges arise, such as (a) presence of more complex dynamics. The dynamics over networks that we consider are complex. For example, C. Difficile spreads via both people-to-people and surface-to-people interactions and correlations between failures in critical infrastructures go beyond the network structure and depend on the geography as well. Traditional approaches either rely on models like Susceptible Infectious (SI) and Independent Cascade (IC) which are too restrictive because they focus only on single pathways or do not incorporate the model at all, resulting in sub-optimality. (b) data sparsity. Additionally, the data sparsity still persists in this space. Specifically, it is difficult to collect the exact state of each node in the network as it is high-dimensional and difficult to directly sample from. (c) mismatch between data and process. In many situations, the underlying dynamical process is unknown or depends on a mixture of several models. In such cases, there is a mismatch between the data collected and the model representing the dynamics. For example, the weighted influenza like illness (wILI) count released by the CDC, which is meant to represent the raw fraction of total population infected by influenza, actually depends on multiple factors like the number of health-care providers reporting the number and public tendency to seek medical advice. In such cases, methods which generalize well to unobserved (or unknown) models are required. Current approaches often fail in tackling these challenges as they either rely on restrictive models, require large volume of data, and/or work only for predefined models. In this thesis, we propose to leverage domain-based frameworks, which include novel models and analysis techniques, and domain-based low dimensional representation learning to tackle the challenges mentioned above for networks and time-series mining tasks. By developing novel frameworks, we can capture the complex dynamics accurately and analyze them more efficiently. For example, to detect C. Difficile outbreaks in a hospital setting, we use a two-mode disease model to capture multiple pathways of outbreaks and discrete lattice-based optimization framework. Similarly, we propose an information theoretic framework which includes geographically correlated failures in critical infrastructure networks to infer the status of the network components. Moreover, as we use more realistic frameworks to accurately capture and analyze the mechanistic processes themselves, our approaches are effective even with sparse data. At the same time, learning low-dimensional domain-aware embeddings capture domain specific properties (like incidence-based similarity between historical influenza seasons) more efficiently from sparse data, which is useful for subsequent tasks. Similarly, since the domain-aware embeddings capture the model information directly from the data without any modeling assumptions, they generalize better to new models. Our domain-aware frameworks and embeddings enable many applications in critical domains. For example, our domain-aware frameworks for C. Difficile allows different monitoring rates for people and locations, thus detecting more than 95% of outbreaks. Similarly, our framework for product recommendation in e-commerce for queries with sparse engagement data resulted in a 34% improvement over the current Walmart.com search engine. Similarly, our novel framework leads to a near optimal algorithms, with additive approximation guarantee, for inferring network states given a partial observation of the failures in networks. Additionally, by exploiting domain-aware embeddings, we outperform non-trivial competitors by up to 40% for influenza forecasting. Similarly, domain-aware representations of subgraphs helped us outperform non-trivial baselines by up to 68% in the graph classification task. We believe our techniques will be useful for variety of other applications in many areas like social networks, urban computing, and so on.
- A dual boundary classifier for predicting acute hypotensive episodes in critical careBhattacharya, Sakyajit; Huddar, Vijay; Rajan, Vaibhav; Reddy, Chandan K. (PLOS, 2018-02-23)An Acute Hypotensive Episode (AHE) is the sudden onset of a sustained period of low blood pressure and is one among the most critical conditions in Intensive Care Units (ICU). Without timely medical care, it can lead to an irreversible organ damage and death. By identifying patients at risk for AHE early, adequate medical intervention can save lives and improve patient outcomes. In this paper, we design a novel dual-boundary classification based approach for identifying patients at risk for AHE. Our algorithm uses only simple summary statistics of past Blood Pressure measurements and can be used in an online environment facilitating real-time updates and prediction. We perform extensive experiments with more than 4,500 patient records and demonstrate that our method outperforms the previous best approaches of AHE prediction. Our method can identify AHE patients two hours in advance of the onset, giving sufficient time for appropriate clinical intervention with nearly 80% sensitivity and at 95% specificity, thus having very few false positives.
- Evaluating, Understanding, and Mitigating Unfairness in Recommender SystemsYao, Sirui (Virginia Tech, 2021-06-10)Recommender systems are information filtering tools that discover potential matchings between users and items and benefit both parties. This benefit can be considered a social resource that should be equitably allocated across users and items, especially in critical domains such as education and employment. Biases and unfairness in recommendations raise both ethical and legal concerns. In this dissertation, we investigate the concept of unfairness in the context of recommender systems. In particular, we study appropriate unfairness evaluation metrics, examine the relation between bias in recommender models and inequality in the underlying population, as well as propose effective unfairness mitigation approaches. We start with exploring the implication of fairness in recommendation and formulating unfairness evaluation metrics. We focus on the task of rating prediction. We identify the insufficiency of demographic parity for scenarios where the target variable is justifiably dependent on demographic features. Then we propose an alternative set of unfairness metrics that measured based on how much the average predicted ratings deviate from average true ratings. We also reduce these unfairness in matrix factorization (MF) models by explicitly adding them as penalty terms to learning objectives. Next, we target a form of unfairness in matrix factorization models observed as disparate model performance across user groups. We identify four types of biases in the training data that contribute to higher subpopulation error. Then we propose personalized regularization learning (PRL), which learns personalized regularization parameters that directly address the data biases. PRL poses the hyperparameter search problem as a secondary learning task. It enables back-propagation to learn the personalized regularization parameters by leveraging the closed-form solutions of alternating least squares (ALS) to solve MF. Furthermore, the learned parameters are interpretable and provide insights into how fairness is improved. Third, we conduct theoretical analysis on the long-term dynamics of inequality in the underlying population, in terms of the fitting between users and items. We view the task of recommendation as solving a set of classification problems through threshold policies. We mathematically formulate the transition dynamics of user-item fit in one step of recommendation. Then we prove that a system with the formulated dynamics always has at least one equilibrium, and we provide sufficient conditions for the equilibrium to be unique. We also show that, depending on the item category relationships and the recommendation policies, recommendations in one item category can reshape the user-item fit in another item category. To summarize, in this research, we examine different fairness criteria in rating prediction and recommendation, study the dynamic of interactions between recommender systems and users, and propose mitigation methods to promote fairness and equality.
- Event-related Collections Understanding and ServicesLi, Liuqing (Virginia Tech, 2020-03-18)Event-related collections, including both tweets and webpages, have valuable information, and are worth exploring in interdisciplinary research and education. Unfortunately, such data is noisy, so this variety of information has not been adequately exploited. Further, for better understanding, more knowledge hidden behind events needs to be unearthed. Regarding these collections, different societies may have different requirements in particular scenarios. Some may need relatively clean datasets for data exploration and data mining. Social researchers require preprocessing of information, so they can conduct analyses. General societies are interested in the overall descriptions of events. However, few systems, tools, or methods exist to support the flexible use of event-related collections. In this research, we propose a new, integrated system to process and analyze event-related collections at different levels (i.e., data, information, and knowledge). It also provides various services and covers the most important stages in a system pipeline, including collection development, curation, analysis, integration, and visualization. Firstly, we propose a query likelihood model with pre-query design and post-query expansion to rank a webpage corpus by query generation probability, and retrieve relevant webpages from event-related tweet collections. We further preserve webpage data into WARC files and enrich original tweets with webpages in JSON format. As an application of data management, we conduct an empirical study of the embedded URLs in tweets based on collection development and data curation techniques. Secondly, we develop TwiRole, an integrated model for 3-way user classification on Twitter, which detects brand-related, female-related, and male-related tweeters through multiple features with both machine learning (i.e., random forest classifier) and deep learning (i.e., an 18-layer ResNet) techniques. As guidance to user-centered social research at the information level, we combine TwiRole with a pre-trained recurrent neural network-based emotion detection model, and carry out tweeting pattern analyses on disaster-related collections. Finally, we propose a tweet-guided multi-document summarization (TMDS) model, which generates summaries of the event-related collections by using tweets associated with those events. The TMDS model also considers three aspects of named entities (i.e., importance, relatedness, and diversity) as well as topics, to score sentences in webpages, and then rank selected relevant sentences in proper order for summarization. The entire system is realized using many technologies, such as collection development, natural language processing, machine learning, and deep learning. For each part, comprehensive evaluations are carried out, that confirm the effectiveness and accuracy of our proposed approaches. Regarding broader impact, the outcomes proposed in our study can be easily adopted or extended for further event analyses and service development.
- A Framework for Automated Discovery and Analysis of Suspicious Trade RecordsDatta, Debanjan (Virginia Tech, 2022-05-27)Illegal logging and timber trade presents a persistent threat to global biodiversity and national security due to its ties with illicit financial flows, and causes revenue loss. The scale of global commerce in timber and associated products, combined with the complexity and geographical spread of the supply chain entities present a non-trivial challenge in detecting such transactions. International shipment records, specifically those containing bill of lading is a key source of data which can be used to detect, investigate and act upon such transactions. The comprehensive problem can be described as building a framework that can perform automated discovery and facilitate actionability on detected transactions. A data driven machine learning based approach is necessitated due to the volume, velocity and complexity of international shipping data. Such an automated framework can immensely benefit our targeted end-users---specifically the enforcement agencies. This overall problem comprises of multiple connected sub-problems with associated research questions. We incorporate crucial domain knowledge---in terms of data as well as modeling---through employing expertise of collaborating domain specialists from ecological conservationist agencies. The collaborators provide formal and informal inputs spanning across the stages---from requirement specification to the design. Following the paradigm of similar problems such as fraud detection explored in prior literature, we formulate the core problem of discovering suspicious transactions as an anomaly detection task. The first sub-problem is to build a system that can be used find suspicious transactions in shipment data pertaining to imports and exports of multiple countries with different country specific schema. We present a novel anomaly detection approach---for multivariate categorical data, following constraints of data characteristics, combined with a data pipeline that incorporates domain knowledge. The focus of the second problem is U.S. specific imports, where data characteristics differ from the prior sub-problem---with heterogeneous attributes present. This problem is important since U.S. is a top consumer and there is scope of actionable enforcement. For this we present a contrastive learning based anomaly detection model for heterogeneous tabular data, with performance and scalability characteristics applicable to real world trade data. While the first two problems address the task of detecting suspicious trades through anomaly detection, a practical challenge with anomaly detection based systems is that of relevancy or scenario specific precision. The third sub-problem addresses this through a human-in-the-loop approach augmented by visual analytics, to re-rank anomalies in terms of relevance---providing explanations for cause of anomalies and soliciting feedback. The last sub-problem pertains to explainability and actionability towards suspicious records, through algorithmic recourse. Algorithmic recourse aims to provides meaningful alternatives towards flagged anomalous records, such that those counterfactual examples are not judged anomalous by the underlying anomaly detection system. This can help enforcement agencies advise verified trading entities in modifying their trading patterns to avoid false detection, thus streamlining the process. We present a novel formulation and metrics for this unexplored problem of algorithmic recourse in anomaly detection. and a deep learning based approach towards explaining anomalies and generating counterfactuals. Thus the overall research contributions presented in this dissertation addresses the requirements of the framework, and has general applicability in similar scenarios beyond the scope of this framework.
- A Framework for Hadoop Based Digital Libraries of TweetsBock, Matthew (Virginia Tech, 2017-07-17)The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. Researchers across varying disciplines have an interest in leveraging DLRL's collections of tweets for their own analyses. However, due to the steep learning curve involved with the required tools (Spark, Scala, HBase, etc.), simply converting the Twitter data into a workable format can be a cumbersome task in itself. This prompted the effort to build a framework that will help in developing code to analyze the Twitter data, run on arbitrary tweet collections, and enable developers to leverage projects designed with this general use in mind. The intent of this thesis work is to create an extensible framework of tools and data structures to represent Twitter data at a higher level and eliminate the need to work with raw text, so as to make the development of new analytics tools faster, easier, and more efficient. To represent this data, several data structures were designed to operate on top of the Hadoop and Spark libraries of tools. The first set of data structures is an abstract representation of a tweet at a basic level, as well as several concrete implementations which represent varying levels of detail to correspond with common sources of tweet data. The second major data structure is a collection structure designed to represent collections of tweet data structures and provide ways to filter, clean, and process the collections. All of these data structures went through an iterative design process based on the needs of the developers. The effectiveness of this effort was demonstrated in four distinct case studies. In the first case study, the framework was used to build a new tool that selects Twitter data from DLRL's archive of tweets, cleans those tweets, and performs sentiment analysis within the topics of a collection's topic model. The second case study applies the provided tools for the purpose of sociolinguistic studies. The third case study explores large datasets to accumulate all possible analyses on the datasets. The fourth case study builds metadata by expanding the shortened URLs contained in the tweets and storing them as metadata about the collections. The framework proved to be useful and cut development time for all four of the case studies.
- «
- 1 (current)
- 2
- 3
- »