Scholarly Works, Computer Science

Permanent URI for this collection

Research articles, presentations, and other scholarship


Recent Submissions

Now showing 1 - 20 of 559
  • Towards Semantically-Rich Spatial Network Representation Learning via Automated Feature Topic Pairing
    Wang, Dongjie; Liu, Kunpeng; Mohaisen, David; Wang, Pengyang; Lu, Chang-Tien; Fu, Yanjie (Frontiers, 2021-10-20)
    Automated characterization of spatial data is a kind of critical geographical intelligence. As an emerging technique for characterization, spatial Representation Learning (SRL) uses deep neural networks (DNNs) to learn non-linear embedded features of spatial data for characterization. However, SRL extracts features by internal layers of DNNs, and thus suffers from lacking semantic labels. Texts of spatial entities, on the other hand, provide semantic understanding of latent feature labels, but is insensible to deep SRL models. How can we teach a SRL model to discover appropriate topic labels in texts and pair learned features with the labels? This paper formulates a new problem: feature-topic pairing, and proposes a novel Particle Swarm Optimization (PSO) based deep learning framework. Specifically, we formulate the feature-topic pairing problem into an automated alignment task between 1) a latent embedding feature space and 2) a textual semantic topic space. We decompose the alignment of the two spaces into: 1) point-wise alignment, denoting the correlation between a topic distribution and an embedding vector; 2) pair-wise alignment, denoting the consistency between a feature-feature similarity matrix and a topic-topic similarity matrix. We design a PSO based solver to simultaneously select an optimal set of topics and learn corresponding features based on the selected topics. We develop a closed loop algorithm to iterate between 1) minimizing losses of representation reconstruction and feature-topic alignment and 2) searching the best topics. Finally, we present extensive experiments to demonstrate the enhanced performance of our method.
  • Fast and adaptive dynamics-on-graphs to dynamics-of-graphs translation
    Zhang, Lei; Chen, Zhiqian; Lu, Chang-Tien; Zhao, Liang (Frontiers, 2023-11-17)
    Numerous networks in the real world change with time, producing dynamic graphs such as human mobility networks and brain networks. Typically, the “dynamics on graphs” (e.g., changing node attribute values) are visible, and they may be connected to and suggestive of the “dynamics of graphs” (e.g., evolution of the graph topology). Due to two fundamental obstacles, modeling and mapping between them have not been thoroughly explored: (1) the difficulty of developing a highly adaptable model without solid hypotheses and (2) the ineffectiveness and slowness of processing data with varying granularity. To solve these issues, we offer a novel scalable deep echo-state graph dynamics encoder for networks with significant temporal duration and dimensions. A novel neural architecture search (NAS) technique is then proposed and tailored for the deep echo-state encoder to ensure strong learnability. Extensive experiments on synthetic and actual application data illustrate the proposed method's exceptional effectiveness and efficiency.
  • Shark detection and classification with machine learning
    Jenrette, Jeremy; Liu, Zac; Chimote, Pranav; Hastie, Trevor; Fox, Edward; Ferretti, Francesco (Elsevier, 2022-07-01)
  • Computation of Direct Sensitivities of Spatial Multibody Systems with Joint Friction
    Verulkar, Adwait; Sandu, Corina; Dopico, Daniel; Sandu, Adrian (ASME, 2022-07)
    Friction exists in most mechanical systems and may have a major influence on the dynamic performance of the system. The incorporation of friction in dynamic systems has been a subject of active research for several years owing to its high nonlinearity and its dependence on several parameters. Consequently, optimization of dynamic systems with friction becomes a challenging task. Gradient-based optimization of dynamical systems is a prominent technique for optimal design and requires the computation of model sensitivities with respect to the design parameters. The novel contribution of this paper is the derivation of the analytical methodology for the computation of direct sensitivities for smooth multibody systems with joint friction using the Lagrangian index-1 formulation. System dynamics have been computed using two different friction models; the Brown and McPhee, and the Gonthier et al. model. The methodology proposed to obtain model sensitivities has also been validated using the complex finite difference method. A case study has been conducted on a spatial multibody system to observe the effect of friction on the dynamics and model sensitivities, compare sensitivities with respect to different parameters and demonstrate the numerical and validation aspects. Since design parameters can have very different magnitudes and units, the sensitivities have been scaled with the parameters for comparison. Finally, a discussion has been presented on the interpretation of the case study results. Due to the incorporation of joint friction, ‘jumps’ or discontinuities are observed in the model sensitivities akin to those observed for hybrid dynamical systems.
  • ARGem: a new metagenomics pipeline for antibiotic resistance genes: metadata, analysis, and visualization
    Liang, Xiao; Zhang, Jingyi; Kim, Yoonjin; Ho, Josh; Liu, Kevin; Keenum, Ishi M.; Gupta, Suraj; Davis, Benjamin; Hepp, Shannon L.; Zhang, Liqing; Xia, Kang; Knowlton, Katharine F.; Liao, Jingqiu; Vikesland, Peter J.; Pruden, Amy; Heath, Lenwood S. (Frontiers, 2023-09-15)
    Antibiotic resistance is of crucial interest to both human and animal medicine. It has been recognized that increased environmental monitoring of antibiotic resistance is needed. Metagenomic DNA sequencing is becoming an attractive method to profile antibiotic resistance genes (ARGs), including a special focus on pathogens. A number of computational pipelines are available and under development to support environmental ARG monitoring; the pipeline we present here is promising for general adoption for the purpose of harmonized global monitoring. Specifically, ARGem is a user-friendly pipeline that provides full-service analysis, from the initial DNA short reads to the final visualization of results. The capture of extensive metadata is also facilitated to support comparability across projects and broader monitoring goals. The ARGem pipeline offers efficient analysis of a modest number of samples along with affordable computational components, though the throughput could be increased through cloud resources, based on the user’s configuration. The pipeline components were carefully assessed and selected to satisfy tradeoffs, balancing efficiency and flexibility. It was essential to provide a step to perform short read assembly in a reasonable time frame to ensure accurate annotation of identified ARGs. Comprehensive ARG and mobile genetic element databases are included in ARGem for annotation support. ARGem further includes an expandable set of analysis tools that include statistical and network analysis and supports various useful visualization techniques, including Cytoscape visualization of co-occurrence and correlation networks. The performance and flexibility of the ARGem pipeline is demonstrated with analysis of aquatic metagenomes. The pipeline is freely available at
  • How are Multilingual Systems Constructed: Characterizing Language Use and Selection in Open-Source Multilingual Software
    Li, Wen; Marino, Austin; Yang, Haoran; Meng, Na; Li, Li; Cai, Haipeng (ACM, 2023-12)
    For many years now, modern software is known to be developed in multiple languages (hence termed as multilingual or multi-language software). Yet to this date we still only have very limited knowledge about how multilingual software systems are constructed. For instance, it is not yet really clear how diferent languages are used, selected together, and why they have been so in multilingual software development. Given the fact that using multiple languages in a single software project has become a norm, understanding language use and selection (i.e, language proile) as a basic element of the multilingual construction in contemporary software engineering is an essential first step. In this paper, we set out to ill this gap with a large-scale characterization study on language use and selection in open-source multilingual software. We start with presenting an updated overview of language use in 7,113 GitHub projects spanning ive past years by characterizing overall statistics of language proiles, followed by a deeper look into the functionality relevance/justiication of language selection in these projects through association rule mining.We proceed with an evolutionary characterization of 1,000 GitHub projects for each of 10 past years to provide a longitudinal view of how language use and selection have changed over the years, as well as how the association between functionality and language selection has been evolving. Among many other indings, our study revealed a growing trend of using 3 to 5 languages in one multilingual software project and noticeable stableness of top language selections. We found a non-trivial association between language selection and certain functionality domains, which was less stable than that with individual languages over time. In a historical context, we also have observed major shifts in these characteristics of multilingual systems both in contrast to earlier peer studies and along the evolutionary timeline. Our indings ofer essential knowledge on the multilingual construction in modern software development. Based on our results, we also provide insights and actionable suggestions for both researchers and developers of multilingual systems.
  • Poster: Cybersecurity Usage in the Wild: A look at Deployment Challenges in Intrusion Detection and Alert Handling
    Sweat, Wyatt; Yao, Danfeng (Daphne) (ACM, 2023-11-15)
    We examine the challenges cybersecurity practitioners face during their daily activities, employing a survey and semi-directed interview for data gathering. Practitioners report on the frequency and level of threats as well as other factors like burnout. These factors are observed to vary with organization size and field (e.g. Medical, E-commerce).
  • Photo Steward: A Deliberative Collective Intelligence Workflow for Validating Historical Archives
    Mohanty, Vikram; Luther, Kurt (ACM, 2023-11-06)
    Historical photographs of people generate significant cultural and economic value, but correctly identifying the subjects of photos can be a difficult task, requiring careful attention to detail while synthesizing large amounts of data from diverse sources. When photos are misidentified, the negative consequences can include financial losses and inaccuracies in the historical record, and even the spread of mis- and disinformation. To address this challenge, we introduce Photo Steward, an information stewardship architecture that leverages a deliberative workflow for validating historical photo IDs. We explored Photo Steward in the context of Civil War Photo Sleuth (CWPS), a popular online community dedicated to identifying photos from the American Civil War era (1861–65) using facial recognition and crowdsourcing. While the platform has been successful in identifying hundreds of unknown photographs, there have been concerns about unverified identifications and misidentifications. Our exploratory evaluation of Photo Steward on CWPS showed that its validation workflow encouraged users to deliberate while making photo ID decisions. Further, its stewardship visualizations helped users to assess photo ID information accurately, while fostering diverse forms of stigmergic collaboration.
  • ML-Assisted Optimization of Securities Lending
    Prasad, Abhinav; Arunachalam, Prakash; Motamedi, Ali; Bhattacharya, Ranjeeta; Liu, Beibei; McCormick, Hays; Xu, Shengzhe; Muralidhar, Nikhil; Ramakrishnan, Naren (ACM, 2023-11-27)
    This paper presents an integrated methodology to forecast the direction and magnitude of movements of lending rates in security markets. We develop a sequence-to-sequence (seq2seq) modeling framework that integrates feature engineering, motif mining, and temporal prediction in a unified manner to perform forecasting at scale in real-time or near real-time.We have deployed this approach in a large custodial setting demonstrating scalability to a large number of equities as well as newly introduced IPO-based securities in highly volatile environments.
  • TGEditor: Task-Guided Graph Editing for Augmenting Temporal Financial Transaction Networks
    Zhang, Shuaicheng; Zhu, Yada; Zhou, Dawei (ACM, 2023-11-27)
    Recent years have witnessed a growth of research interest in designing powerful graph mining algorithms to discover and characterize the structural pattern of interests from financial transaction networks, motivated by impactful applications including anti-money laundering, identity protection, product promotion, and service promotion. However, state-of-the-art graph mining algorithms often suffer from high generalization errors due to data sparsity, data noisiness, and data dynamics. In the context of mining information from financial transaction networks, the issues of data sparsity, noisiness, and dynamics become particularly acute. Ensuring accuracy and robustness in such evolving systems is of paramount importance. Motivated by these challenges, we propose a fundamental transition from traditional mining to augmentation in the context of financial transaction networks. To navigate this paradigm shift, we introduce TGEditor, a versatile task-guided temporal graph augmentation framework. This framework has been crafted to concurrently preserve the temporal and topological distribution of input financial transaction networks, whilst leveraging the label information from pertinent downstream tasks, denoted as T, inclusive of crucial downstream tasks like fraudulent transaction classification. In particular, to efficiently conduct task-specific augmentation, we propose two network editing operators that can be seamlessly optimized via adversarial training, while simultaneously capturing the dynamics of the data: Add operator aims to recover the missing temporal links due to data sparsity, and Prune operator is formulated to remove irrelevant/noisy temporal links due to data noisiness. Extensive results on financial transaction networks demonstrate that TGEditor 1) well preserves the data distribution of the original graph and 2) notably boosts the performance of the prediction models in the tasks of vertex classification and fraudulent transaction detection.
  • Exploring and Evaluating the Potential of 2D Computational Notebooks
    Harden, Jesse (ACM, 2023-11-05)
    Computational notebooks are popular tools for data science and presentation of computational narratives. However, their 1D structure introduces and exacerbates user issues, such as messiness, tedious navigation, inefficient use of large screen space, performance of non-linear analyses, and presentation of non-linear narratives. In this Ph.D., we address these issues through the design, exploration, and evaluation of computational notebooks which use 2D space to organize cells, or 2D computational notebooks. Specifically, we explore whether users would use 2D space, design and evaluate a 2D computational notebook prototype for individual work, explore how users collaborate in 2D space for data science and education, create and validate a theoretical understanding of how nonlinear processes in data science cause problems when forced into a linear, 1D computational notebook, and build upon the foundation we have made to refine 2D computational notebooks. Our work contributes insights on if and how expanded space usage can improve computational notebooks.
  • SAGE3 for Interactive Collaborative Visualization, Analysis, and Storytelling
    Harden, Jesse; Kirshenbaum, Nurit; Tabalba Jr., Roderick S.; Leigh, Jason; Renambot, Luc; North, Chris (ACM, 2023-11-05)
    SAGE3, the newest and most advanced generation of the Smart Amplified Group Environment, is an open-source software designed to facilitate collaboration among scientists, researchers, students, and professionals across various fields. This tutorial aims to introduce attendees to the capabilities of SAGE3, demonstrating its ability to enhance collaboration and productivity in diverse settings, from co-located office collaboration to remote collaboration to both at once, with diverse displays, from personal laptops to large-scale display walls. Participants will learn how to effectively use SAGE3 for brainstorming, data analysis, and presentation purposes, as well as installation of private collaboration servers and development of custom applications.
  • No Root Store Left Behind
    Larisch, James; Aqeel, Waqar; Chung, Taejoong; Kohler, Eddie; Levin, Dave; Maggs, Bruce; Parno, Bryan; Wilson, Christo (ACM, 2023-11-28)
    When a root certificate authority (CA) in the Web PKI misbehaves, primary root-store operators such as Mozilla and Google respond by distrusting that CA. However, full distrust is often too broad, so root stores often implement partial distrust of roots, such as only accepting a root for a subset of domains. Unfortunately, derivative root stores (e.g., Debian and Android) that mirror decisions made by primary root stores are often out-of-date and cannot implement partial distrust, leaving TLS applications vulnerable. We propose augmenting root stores with per-certificate programs called General Certificate Constraints (GCCs) that precisely control the trust of root certificates. We propose that primary root-store operators write GCCs and distribute them, along with routine root certificate additions and removals, to all root stores in the Web PKI. To justify our arguments, we review specific instances of CA certificate mis-issuance over the last decade that resulted in partial distrust of roots that derivative root stores were unable to precisely mirror. We also review prior work that illustrates the alarming lag between primary and derivative root stores.We discuss preliminary designs for GCC deployment and how GCCs could enable pre-emptive restrictions on CA power.
  • Motivational climate predicts effort and achievement in a large computer science course: examining differences across sexes, races/ethnicities, and academic majors
    Jones, Brett D.; Ellis, Margaret; Gu, Fei; Fenerci, Hande (2023-11-13)
    Background The motivational climate within a course has been shown to be an important predictor of students’ engagement and course ratings. Because little is known about how students’ perceptions of the motivational climate in a computer science (CS) course vary by sex, race/ethnicity, and academic major, we investigated these questions: (1) To what extent do students’ achievement and perceptions of motivational climate, cost, ease, and effort vary by sex, race/ethnicity, or major? and (2) To what extent do the relationships between students’ achievement and perceptions of motivational climate, cost, and effort vary by sex, race/ethnicity, and major? Participants were enrolled in a large CS course at a large public university in the southeastern U.S. A survey was administered to 981 students in the course over three years. Path analyses and one-way MANOVAs and ANOVAs were conducted to examine differences between groups. Results Students’ perceptions of empowerment, usefulness, interest, and caring were similar across sexes and races/ethnicities. However, women and Asian students reported lower success expectancies. Students in the same academic major as the course topic (i.e., CS) generally reported higher perceptions of the motivational climate than students who did not major or minor in the course topic. Final grades in the course did not vary by sex or race/ethnicity, except that the White and Asian students obtained higher grades than the Black students. Across sex, race/ethnicity, and major, students’ perceptions of the motivational climate were positively related to effort, which was positively related to achievement. Conclusions One implication is that females, Asian students, and non-CS students may need more support, or different types of support, to help them believe that they can succeed in computer science courses. On average, these students were less confident in their abilities to succeed in the course and were more likely to report that they did not have the time needed to do well in the course. A second implication for instructors is that it may be possible to increase students’ effort and achievement by increasing students’ perceptions of the five key constructs in the MUSIC Model of Motivation: eMpowerment, Usefulness, Success, Interest, and Caring.
  • RoVista: Measuring and Analyzing the Route Origin Validation (ROV) in RPKI
    Li, Weitong; Lin, Zhexiao; Ashiq, Md. Ishtiaq; Aben, Emile; Fontugne, Romain; Phokeer, Amreesh; Chung, Taejoong (ACM, 2023-10-24)
    The Resource Public Key Infrastructure (RPKI) is a system to add security to the Internet routing. In recent years, the publication of Route Origin Authorization (ROA) objects, which bind IP prefixes to their legitimate origin ASN, has been rapidly increasing. However, ROAs are effective only if the routers use them to verify and filter invalid BGP announcements, a process called Route Origin Validation (ROV). There are many proposed approaches to measure the status of ROV in the wild, but they are limited in scalability or accuracy. In this paper, we present RoVista, an ROV measurement framework that leverages IP-ID side channel and in-the-wild RPKI-invalid prefix. With over 20 months of longitudinal measurement, RoVista successfully covers more than 28K ASes where 63.8% of ASes have derived benefits from ROV, although the percentage of fully protected ASes remains relatively low at 12.3%. In order to validate our findings, we have also sought input from network operators. We then evaluate the security impact of current ROV deployment and reveal misconfigurations that will weaken the protection of ROV. Lastly, we compare RoVista with other approaches and conclude with a discussion of our findings and limitations.
  • Physics-Guided Deep Generative Model For New Ligand Discovery
    Sagar, Dikshant; Risheh, Ali; Sheikh, Nida; Forouzesh, Negin (ACM, 2023-09-03)
    Structure-based drug discovery aims to identify small molecules that can attach to a specific target protein and change its functionality. Recently, deep learning has shown great promise in generating drug-like molecules with specific biochemical features and conditioned with structural features. However, they usually fail to incorporate an essential factor: the underlying physics which guides molecular formation and binding in real-world scenarios. In this work, we describe a physics-guided deep generative model for new ligand discovery, conditioned not only on the binding site but also on physics-based features that describe the binding mechanism between a receptor and a ligand. The proposed hybrid model has been tested on large protein-ligand complexes and small host-guest systems. Using the top-𝑁 methodology, on average more than 75% of the generated structures by our hybrid model were stronger binders than the original reference ligand. All of them had higher Δ𝐺𝑏𝑖𝑛𝑑 (affinity) values than the ones generated by the previous state-of-the-art method by an average margin of 1.88 kcal/mol. The visualization of the top-5 ligands generated by the proposed physics-guided model and the reference deep learning model demonstrate more feasible conformations and orientations by the former. The future directions include training and testing the hybrid model on larger datasets, adding more relevant physics-based features, and interpreting the deep learning outcomes from biophysical perspectives.
  • Text-to-ESQ: A Two-Stage Controllable Approach for Efficient Retrieval of Vaccine Adverse Events from NoSQL Database
    Zhang, Wenlong; Zeng, Kangping; Yang, Xinming; Shi, Tian; Wang, Ping (ACM, 2023-09-03)
    The Vaccine Adverse Event Reporting System (VAERS) contains detailed reports of adverse events following vaccine administration. However, efficiently and accurately searching for specific information from VAERS poses significant challenges, especially for medical experts. Natural language querying (NLQ) methods tackle the challenge by translating the input questions into executable queries, allowing for the exploration of complex databases with large amounts of information. Most existing studies focus on the relational database and solve the Text-to-SQL task. However, the capability of full-text for Text-to-SQL is greatly limited by the data structures and functionality of the SQL databases. In addition, the potential of natural language querying has not been comprehensively explored in the healthcare domain. To overcome these limitations, we investigate the potential of NoSQL databases, specifically Elasticsearch, and forge a new research direction for NLQ, which we refer to as Text-to-ESQ generation. This exploration requires us to re-design various aspects of NLQ, such as the target application and the advantages of NoSQL database. In our approach, we develop a two-stage controllable (TSC) framework consisting of a question-to-question (Q2Q) translation module and an ESQ condition extraction (ECE) module. These modules are carefully designed to efficiently retrieve information from the VEARS data stored in a NoSQL database. Additionally, we construct a dedicated question-ESQ pair dataset called VAERSESQ, to support the task in the healthcare domain. Extensive experiments were conducted on the VAERSESQ dataset to evaluate the proposed methods. The results, both quantitative and qualitative, demonstrate the accuracy and efficiency of our approach in generating queries for NoSQL databases, thus enabling efficient retrieval of VEARS data.
  • GRAPPEL: A Graph-based Approach for Early Risk Assessment of Acute Hypertension in Critical Care
    Jha, Sonal; Feng, Wu-chun (ACM, 2023-09-03)
    An acute hypertensive episode (AHE) refers to a period of extremely high blood pressure (BP) that can arise suddenly in critical care, and, if not identified early, can subject patients to the risk of severe organ damage and even potential mortality. The early assessment of AHE risk saves lives by enabling proactive medical intervention. We propose GRAPPEL, a novel graph-based approach that assesses a patient’s risk of experiencing an AHE before it occurs based on the analysis of their BP recorded over time in critical care. Our algorithm consists of two major components: (1) the construction of a time-evolving graph representation of a patient’s time-series BP data to encode the temporal BP variations into a graph and (2) the generation of real-time AHE risk scores based on quantifying the graph changes at each time step, triggered by the arrival of a new BP record. Notably, GRAPPEL provides real-time and early AHE risk assessment based solely on BP records that can be irregularly spaced in time, making it suitable for critical care environments. Via our extensive experiments on 3,476 critical-care visit records, we demonstrate the superiority of our approach over existing methods by achieving an AUC-ROC score of 91% in identifying patients at risk of experiencing an AHE up to 170 minutes in advance (and an AUC-ROC score of 94% up to 20 minutes in advance).
  • MArBLE: Hierarchical Multi-Armed Bandits for Human-in-the-Loop Set Expansion
    Wahed, Muntasir; Gruhl, Daniel; Lourentzou, Ismini (ACM, 2023-10-21)
    The modern-day research community has an embarrassment of riches regarding pre-trained AI models. Even for a simple task such as lexicon set expansion, where an AI model suggests new entities to add to a predefined seed set of entities, thousands of models are available. However, deciding which model to use for a given set expansion task is non-trivial. In hindsight, some models can be ‘off topic’ for specific set expansion tasks, while others might work well initially but quickly exhaust what they have to offer. Additionally, certain models may require more careful priming in the form of samples or feedback before being fine-tuned to the task at hand. In this work, we frame this model selection as a sequential non-stationary problem, where there exist a large number of diverse pre-trained models that may or may not fit a task at hand, and an expert is shown one suggestion at a time to include in the set or not, i.e., accept or reject the suggestion. The goal is to expand the list with the most entities as quickly as possible. We introduce MArBLE, a hierarchical multi-armed bandit method for this task, and two strategies designed to address cold-start problems. Experimental results on three set expansion tasks demonstrate MArBLE’s effectiveness compared to baselines.
  • Knowledge-Enhanced Multi-Label Few-Shot Product Attribute-Value Extraction
    Gong, Jiaying; Chen, Wei-Te; Eldardiry, Hoda (ACM, 2023-10-21)
    Existing attribute-value extraction (AVE) models require large quantities of labeled data for training. However, new products with new attribute-value pairs enter the market every day in real-world e- Commerce. Thus, we formulate AVE in multi-label few-shot learning (FSL), aiming to extract unseen attribute value pairs based on a small number of training examples. We propose a Knowledge- Enhanced Attentive Framework (KEAF) based on prototypical networks, leveraging the generated label description and category information to learn more discriminative prototypes. Besides, KEAF integrates with hybrid attention to reduce noise and capture more informative semantics for each class by calculating the label-relevant and query-related weights. To achieve multi-label inference, KEAF further learns a dynamic threshold by integrating the semantic information from both the support set and the query set. Extensive experiments with ablation studies conducted on two datasets demonstrate that our proposed model significantly outperforms other SOTA models for information extraction in few-shot learning.