Scholarly Works, Computer Science

Permanent URI for this collection

Research articles, presentations, and other scholarship


Recent Submissions

Now showing 1 - 20 of 666
  • Fastmove: A Comprehensive Study of On-Chip DMA and its Demonstration for Accelerating Data Movement in NVM-based Storage Systems
    Li, Jiahao; Su, Jingbo; Chen, Luofan; Li, Cheng; Zhang, Kai; Yang, Liang; Noh, Sam; Xu, Yinlong (ACM, 2024)
    Data-intensive applications executing on NVM-based storage systems experience serious bottlenecks when moving data between DRAM and NVM. We advocate for the use of the long-existing but recently neglected on-chip DMA to expedite data movement with three contributions. First, we explore new latency-oriented optimization directions, driven by a comprehensive DMA study, to design a high-performance DMA module, which significantly lowers the I/O size threshold to observe benefits. Second, we propose a new data movement engine, Fastmove, that coordinates the use of the DMA along with the CPU with judicious scheduling and load splitting such that the DMA?s limitations are compensated, and the overall gains are maximized. Finally, with a general kernel-based design, simple APIs, and DAX file system integration, Fastmove allows applications to transparently exploit the DMA and its new features without code change. We run three data-intensive applications MySQL, GraphWalker, and Filebench atop NOVA, ext4-DAX, and XFS-DAX, with standard benchmarks like TPC-C, and popular graph algorithms like PageRank. Across single- and multi-socket settings, compared to the conventional CPU-only NVM accesses, Fastmove introduces to TPC-C with MySQL 1.13-2.16× speedups of peak throughput, reduces the average latency by 17.7-60.8%, and saves 37.1-68.9% CPU usage spent in data movement. It also shortens the execution time of graph algorithms with GraphWalker by 39.7-53.4%, and introduces 1.12-1.27× throughput speedups for Filebench.
  • Neural Methods for Data-to-text Generation
    Sharma, Mandar; Gogineni, Ajay; Ramakrishnan, Naren (ACM, 2024)
    The neural boom that has sparked natural language processing (NLP) research throughout the last decade has similarly led to significant innovations in data-to-text generation (DTG). This survey offers a consolidated view into the neural DTG paradigm with a structured examination of the approaches, benchmark datasets, and evaluation protocols. This survey draws boundaries separating DTG from the rest of the natural language generation (NLG) landscape, encompassing an up-to-date synthesis of the literature, and highlighting the stages of technological adoption from within and outside the greater NLG umbrella. With this holistic view, we highlight promising avenues for DTG research that not only focus on the design of linguistically capable systems but also systems that exhibit fairness and accountability.
  • CHUNAV: Analyzing Hindi Hate Speech and Targeted Groups in Indian Election Discourse
    Jafri, Farhan; Rauniyar, Kritesh; Thapa, Surendrabikram; Siddiqui, Mohammad; Khushi, Matloob; Naseem, Usman (ACM, 2024)
    In the ever-evolving landscape of online discourse and political dialogue, the rise of hate speech poses a significant challenge to maintaining a respectful and inclusive digital environment. The context becomes particularly complex when considering the Hindi language—a low-resource language with limited available data. To address this pressing concern, we introduce the CHUNAV dataset—a collection of 11,457 Hindi tweets gathered during assembly elections in various states. CHUNAV is purpose-built for hate speech categorization and the identification of target groups. The dataset is a valuable resource for exploring hate speech within the distinctive socio-political context of Indian elections. The tweets within CHUNAV have been meticulously categorized into "Hate" and "Non-Hate" labels, and further subdivided to pinpoint the specific targets of hate speech, including "Individual", "Organization", and "Community" labels (as shown in Figure 1). Furthermore, this paper presents multiple benchmark models for hate speech detection, along with an innovative ensemble and oversampling-based method. The paper also delves into the results of topic modeling, all aimed at effectively addressing hate speech and target identification in the Hindi language. This contribution seeks to advance the field of hate speech analysis and foster a safer and more inclusive online space within the distinctive realm of Indian Assembly Elections. The dataset is available at
  • Multi-Label Zero-Shot Product Attribute-Value Extraction
    Gong, Jiaying; Eldardiry, Hoda (ACM, 2024-05-13)
    E-commerce platforms should provide detailed product descriptions (attribute values) for effective product search and recommendation. However, attribute value information is typically not available for new products. To predict unseen attribute values, large quantities of labeled training data are needed to train a traditional supervised learning model. Typically, it is difficult, time-consuming, and costly to manually label large quantities of new product profiles. In this paper, we propose a novel method to efficiently and effectively extract unseen attribute values from new products in the absence of labeled data (zero-shot setting).We propose HyperPAVE, a multilabel zero-shot attribute value extraction model that leverages inductive inference in heterogeneous hypergraphs. In particular, our proposed technique constructs heterogeneous hypergraphs to capture complex higher-order relations (i.e. user behavior information) to learn more accurate feature representations for graph nodes. Furthermore, our proposed HyperPAVE model uses an inductive link prediction mechanism to infer future connections between unseen nodes. This enables HyperPAVE to identify new attribute values without the need for labeled training data. We conduct extensive experiments with ablation studies on different categories of the MAVE dataset. The results demonstrate that our proposed HyperPAVE model significantly outperforms existing classificationbased, generation-based large language models for attribute value extraction in the zero-shot setting.
  • Towards Understanding Family Privacy and Security Literacy Conversations at Home: Design Implications for Privacy Literacy Interfaces
    Alghythee, Kenan; Hrncic, Adel; Singh, Karthik; Kunisetty, Sumanth; Yao, Yaxing; Soni, Nikita (ACM, 2024-05-11)
    Policymakers and researchers have emphasized the crucial role of parent-child conversations in shaping children’s digital privacy and security literacy. Despite this emphasis, little is known about the current nature of these parent-child conversations, including their content, structure, and children’s engagement during these conversations. This paper presents the findings of an interview study involving 13 parents of children ages under 13 reflecting on their privacy literacy practices at home. Through qualitative thematic analysis, we identify five categories of parent-child privacy and security conversations and examine parents’ perceptions of their children’s engagement during these discussions. Our findings show that although parents used different conversation approaches, rule-based conversations were one of the most common approaches taken by our participants, with example-based conversations perceived to be effective by parents. We propose important design implications for developing effective privacy educational technologies for families to support parent-child conversations.
  • Leveraging Prompt-Based Large Language Models: Predicting Pandemic Health Decisions and Outcomes Through Social Media Language
    Ding, Xiaohan; Carik, Buse; Gunturi, Uma Sushmitha; Reyna, Valerie; Rho, Eugenia (ACM, 2024-05-11)
    We introduce a multi-step reasoning framework using prompt-based LLMs to examine the relationship between social media lan guage patterns and trends in national health outcomes. Grounded in fuzzy-trace theory, which emphasizes the importance of “gists” of causal coherence in effective health communication, we introduce Role-Based Incremental Coaching (RBIC), a prompt-based LLM framework, to identify gists at-scale. Using RBIC, we systematically extract gists from subreddit discussions opposing COVID-19 health measures (Study 1). We then track how these gists evolve across key events (Study 2) and assess their influence on online engage ment (Study 3). Finally, we investigate how the volume of gists is associated with national health trends like vaccine uptake and hospitalizations (Study 4). Our work is the first to empirically link social media linguistic patterns to real-world public health trends, highlighting the potential of prompt-based LLMs in identifying critical online discussion patterns that can form the basis of public health communication strategies.
  • An Empathy-Based Sandbox Approach to Bridge the Privacy Gap among Attitudes, Goals, Knowledge, and Behaviors
    Chen, Chaoran; Li, Weijun; Song, Wenxin; Ye, Yanfang; Yao, Yaxing; Li, Toby (ACM, 2024-05-11)
    Managing privacy to reach privacy goals is challenging, as evidenced by the privacy attitude-behavior gap. Mitigating this discrepancy requires solutions that account for both system opaqueness and users’ hesitations in testing diferent privacy settings due to fears of unintended data exposure.We introduce an empathy-based approach that allows users to experience how privacy attributes may alter system outcomes in a risk-free sandbox environment from the perspective of artifcially generated personas. To generate realistic personas, we introduce a novel pipeline that augments the outputs of large language models (e.g., GPT-4) using few-shot learning, contextualization, and chain of thoughts. Our empirical studies demonstrated the adequate quality of generated personas and highlighted the changes in privacy-related applications (e.g., online advertising) caused by diferent personas. Furthermore, users demonstrated cognitive and emotional empathy towards the personas when interacting with our sandbox. We ofered design implications for downstream applications in improving user privacy literacy.
  • Exploring the Effectiveness of Time-lapse Screen Recording for Self-Reflection in Work Context
    Hu, Donghan; Lee, Sang Won (ACM, 2024-05-11)
    Effective self-tracking in working contexts empowers individuals to explore and reflect on past activities. Recordings of computer activities contain rich metadata that can offer valuable insight into users’ previous tasks and endeavors. However, presenting a simple summary of time usage may not effectively engage users with data because it is not contextualized, and users may not understand what to do with the data. This work explores time-lapse videos as a visual-temporal medium to facilitate self-refection among workers in productivity contexts. To explore this space, we conducted a four-week study (n = 15) to investigate how a computer screen’s history of states can help workers recall previous undertakings and gain comprehensive insights via self-refection. Our results support that watching time-lapse videos can enhance self-refection more effectively than traditional self-tracking tools by providing contextual clues about users’ past activities. The experience with both traditional tools and time-lapse videos resulted in increased productivity. Additionally, time-lapse videos assist users in cultivating a positive understanding of their work. We discuss how multimodal cues, such as time-lapse videos, can complement personal informatics tools.
  • Griot-Style Methodology: Longitudinal Study of Navigating Design With Unwritten Stories
    Kotut, Lindah; Bhatti, Neelma; Hassan, Taha; Haqq, Derek; Saaty, Morva (ACM, 2024-05-11)
    We describe a seven-year longitudinal study conducted in collaboration with an indigenous community in Kenya. We detail the process of conducting research with an oral community: the deliberate practice of understanding and collecting stories; working with inter-generational community to envision and design technologies that support their ways of storytelling and story preservation; and to infuence the design of other technologies. We chronicle how we contended with translating oral stories with rich metaphors to new mediums, and the dimensions of trust we have established and continue to reinforce. We ofer our griot-style methodology, informed by working with the community and retroftting existing HCI approaches: as an example model of what has worked, and the dimensions of challenges at each stage of the research work. The griot-style methodology has prompted a refection on how we approach research, and present opportunities for other HCI research and practice of handling community stories.
  • Evaluating Navigation and Comparison Performance of Computational Notebooks on Desktop and in Virtual Reality
    In, Sungwon; Krokos, Eric; Whitley, Kirsten; North, Christopher L.; Yang, Yalong (ACM, 2024-05-11)
    The computational notebook serves as a versatile tool for data analysis. However, its conventional user interface falls short of keeping pace with the ever-growing data-related tasks, signaling the need for novel approaches. With the rapid development of interaction techniques and computing environments, there is a growing interest in integrating emerging technologies in data-driven workflows. Virtual reality, in particular, has demonstrated its potential in interactive data visualizations. In this work, we aimed to experiment with adapting computational notebooks into VR and verify the potential benefits VR can bring. We focus on the navigation and comparison aspects as they are primitive components in analysts’ workflow. To further improve comparison, we have designed and implemented a Branching&Merging functionality. We tested computational notebooks on the desktop and in VR, both with and without the added Branching&Merging capability. We found VR significantly facilitated navigation compared to desktop, and the ability to create branches enhanced comparison.
  • Broadly Enabling KLEE to Effortlessly Find Unrecoverable Errors in Rust
    Zhang, Ying; Li, Peng; Ding, Yu; Wang, Lingxiang; Williams, Dan; Meng, Na (ACM, 2024)
    Rust is a general-purpose programming language designed for performance and safety. Unrecoverable errors (e.g., Divide by Zero) in Rust programs are critical, as they signal bad program states and terminate programs abruptly. Previous work has contributed to utilizing KLEE, a dynamic symbolic test engine, to verify the program would not panic. However, it is difficult for engineers who lack domain expertise to write test code correctly. Besides, the effectiveness of KLEE in finding panics in production Rust code has not been evaluated. We created an approach, called PanicCheck, to hide the complexity of verifying Rust programs with KLEE. Using PanicCheck, engineers only need to annotate the function-to-verify with #[panic_check]. The annotation guides PanicCheck to generate test code, compile the function together with tests, and execute KLEE for verification. After applying PanicCheck to 21 open-source and 2 closed-source projects, we found 61 test inputs that triggered panics; 59 of the 61 panics have been addressed by developers so far. Our research shows promising verification results by KLEE, while revealing technical challenges in using KLEE. Our experience will shed light on future practice and research in program verification.
  • A First Look at the General Data Protection Regulation (GDPR) in Open-Source Software
    Franke, Lucas; Liang, Huayu; Brantly, Aaron F.; Davis, James C.; Brown, Chris (ACM, 2024-04-14)
    This poster describes work on the General Data Protection Regulation (GDPR) in open-source software. Although open-source software is commonly integrated into regulated software, and thus must be engineered or adapted for compliance, we do not know how such laws impact open-source software development. We surveyed open-source developers (N=47) to understand their experiences and perceptions of GDPR. We learned many engineering challenges, primarily regarding the management of users’ data and assessments of compliance. We call for improved policy-related resources, especially tools to support data privacy regulation implementation and compliance in open-source software.
  • GazeIntent: Adapting Dwell-time Selection in VR Interaction with Real-time Intent Modeling
    Narkar, Anish; Michalak, Jan; Peacock, Candace; David-John, Brendan (ACM, 2024-05-28)
    The use of ML models to predict a user’s cognitive state from behavioral data has been studied for various applications which includes predicting the intent to perform selections in VR.We developed a novel technique that uses gaze-based intent models to adapt dwell-time thresholds to aid gaze-only selection. A dataset of users performing selection in arithmetic tasks was used to develop intent prediction models (F1 = 0.94).We developed GazeIntent to adapt selection dwell times based on intent model outputs and conducted an end-user study with returning and new users performing additional tasks with varied selection frequencies. Personalized models for returning users effectively accounted for prior experience and were preferred by 63% of users. Our work provides the field with methods to adapt dwell-based selection to users, account for experience over time, and consider tasks that vary by selection frequency.
  • An Interpretable Ensemble of Graph and Language Models for Improving Search Relevance in E-Commerce
    Choudhary, Nurendra; Huang, Edward W.; Subbian, Karthik; Reddy, Chandan (ACM, 2024-05-13)
    The problem of search relevance in the E-commerce domain is a challenging one since it involves understanding the intent of a user’s short nuanced query and matching it with the appropriate products in the catalog. This problem has traditionally been addressed using language models (LMs) and graph neural networks (GNNs) to capture semantic and inter-product behavior signals, respectively. However, the rapid development of new architectures has created a gap between research and the practical adoption of these techniques. Evaluating the generalizability of these models for deployment requires extensive experimentation on complex, real-world datasets, which can be non-trivial and expensive. Furthermore, such models often operate on latent space representations that are incomprehensible to humans, making it difficult to evaluate and compare the effectiveness of different models. This lack of interpretability hinders the development and adoption of new techniques in the field. To bridge this gap, we propose Plug and Play Graph LAnguage Model (PP-GLAM), an explainable ensemble of plug and play models. Our approach uses a modular framework with uniform data processing pipelines. It employs additive explanation metrics to independently decide whether to include (i) language model candidates, (ii) GNN model candidates, and (iii) inter-product behavioral signals. For the task of search relevance, we show that PP-GLAM outperforms several state-of-the-art baselines as well as a proprietary model on real-world multilingual, multi-regional e-commerce datasets. To promote better model comprehensibility and adoption, we also provide an analysis of the explainability and computational complexity of our model. We also provide the public codebase and provide a deployment strategy for practical implementation.
  • The probability of chromatin to be at the nuclear lamina has no systematic effect on its transcription level in fruit flies
    Afanasyev, Alexander Y.; Kim, Yoonjin; Tolokh, Igor S.; Sharakhov, Igor V.; Onufriev, Alexey V. (2024-05-06)
    Background: Multiple studies have demonstrated a negative correlation between gene expression and positioning of genes at the nuclear envelope (NE) lined by nuclear lamina, but the exact relationship remains unclear, especially in light of the highly stochastic, transient nature of the gene association with the NE. Results: In this paper, we ask whether there is a causal, systematic, genome-wide relationship between the expression levels of the groups of genes in topologically associating domains (TADs) of Drosophila nuclei and the probabilities of TADs to be found at the NE. To investigate the nature of this possible relationship, we combine a coarse-grained dynamic model of the entire Drosophila nucleus with genome-wide gene expression data; we analyze the TAD averaged transcription levels of genes against the probabilities of individual TADs to be in contact with the NE in the control and lamins-depleted nuclei. Our findings demonstrate that, within the statistical error margin, the stochastic positioning of Drosophila melanogaster TADs at the NE does not, by itself, systematically affect the mean level of gene expression in these TADs, while the expected negative correlation is confirmed. The correlation is weak and disappears completely for TADs not containing lamina-associated domains (LADs) or TADs containing LADs, considered separately. Verifiable hypotheses regarding the underlying mechanism for the presence of the correlation without causality are discussed. These include the possibility that the epigenetic marks and affinity to the NE of a TAD are determined by various non-mutually exclusive mechanisms and remain relatively stable during interphase. Conclusions: At the level of TADs, the probability of chromatin being in contact with the nuclear envelope has no systematic, causal effect on the transcription level in Drosophila. The conclusion is reached by combining model-derived time-evolution of TAD locations within the nucleus with their experimental gene expression levels.
  • SEVeriFast: Minimizing the root of trust for fast startup of SEV microVMs
    Holmes, Benjamin; Waterman, Jason; Williams, Dan (ACM, 2024-04-27)
    Serverless computing platforms rely on fast container initialization to provide low latency and high throughput for requests. While hardware enforced trusted execution environments (TEEs) have gained popularity, confidential computing has yet to be widely adopted by latency-sensitive platforms due to its additional initialization overhead. We investigate the application of AMD’s Secure Encrypted Virtualization (SEV) to microVMs and find that current startup times for confidential VMs are prohibitively slow due to the high cost of establishing a root of trust for each new VM. We present SEVeriFast, a new bootstrap scheme for SEV VMs that reevaluates current microVM techniques for fast boot, such as eliminating bootstrap stages and bypassing guest kernel decompression. Counter-intuitively, we find that introducing an additional bootstrap component and reintroducing kernel compression optimizes the cold boot performance of SEV microVMs by reducing the cost of measurement on the critical boot path and producing a minimal root of trust. To our knowledge, SEVeriFast is the first work to explore the trade-offs associated with booting confidential microVMs and provide a set of guiding principles as a step toward confidential serverless. We show that SEVeriFast improves cold start performance of SEV VMs over current methods by 86-93%.
  • Energy-Adaptive Buffering for Efficient, Responsive, and Persistent Batteryless Systems
    Williams, Harrison; Hicks, Matthew (ACM, 2024-04-27)
    Batteryless energy harvesting systems enable a wide array of new sensing, computation, and communication platforms untethered by power delivery or battery maintenance demands. Energy harvesters charge a buffer capacitor from an unreliable environmental source until enough energy is stored to guarantee a burst of operation despite changes in power input. Current platforms use a fixed-size buffer chosen at design time to meet constraints on charge time or application longevity, but static energy buffers are a poor fit for the highly volatile power sources found in real-world deployments: fixed buffers waste energy both as heat when they reach capacity during a power surplus and as leakage when they fail to charge the system during a power deficit. To maximize batteryless system performance in the face of highly dynamic input power, we propose REACT: a responsive buffering circuit which varies total capacitance according to net input power. REACT uses a variable capacitor bank to expand capacitance to capture incoming energy during a power surplus and reconfigures internal capacitors to reclaim additional energy from each capacitor as power input falls. Compared to fixed-capacity systems, REACT captures more energy, maximizes usable energy, and efficiently decouples system voltage from stored charge—enabling low-power and high-performance designs previously limited by ambient power. Our evaluation on real-world platforms shows that REACT eliminates the tradeoff between responsiveness, efficiency, and longevity, increasing the energy available for useful work by an average 25.6% over static buffers optimized for reactivity and capacity, improving event responsiveness by an average 7.7𝑥 without sacrificing capacity, and enabling programmer directed longevity guarantees.
  • Totoro: A Scalable Federated Learning Engine for the Edge
    Ching, Cheng-Wei; Chen, Xin; Kim, Taehwan; Ji, Bo; Wang, Qingyang; Da Silva, Dilma; Hu, Liting (ACM, 2024-04-22)
    Federated Learning (FL) is an emerging distributed machine learning (ML) technique that enables in-situ model training and inference on decentralized edge devices. We propose Totoro, a novel scalable FL engine, that enables massive FL applications to run simultaneously on edge networks. The key insight is to explore a distributed hash table (DHT)-based peer-to-peer (P2P) model to re-architect the centralized FL system design into a fully decentralized one. In contrast to previous studies where many FL applications shared one centralized parameter server, Totoro assigns a dedicated parameter server to each individual application. Any edge node can act as any application’s coordinator, aggregator, client selector, worker (participant device), or any combination of the above, thereby radically improving scalability and adaptivity. Totoro introduces three innovations to realize its design: a locality-aware P2P multi-ring structure, a publish/subscribebased forest abstraction, and a bandit-based exploitationexploration path planning model. Real-world experiments on 500 Amazon EC2 servers show that Totoro scales gracefully with the number of FL applications and 𝑁 edge nodes, speeds up the total training time by 1.2 × −14.0×, achieves 𝑂 (𝑙𝑜𝑔𝑁 ) hops for model dissemination and gradient aggregation with millions of nodes, and efficiently adapts to the practical edge networks and churns.
  • FLOAT: Federated Learning Optimizations with Automated Tuning
    Khan, Ahmad; Khan, Azal Ahmad; Abdelmoniem, Ahmed M.; Fountain, Samuel; Butt, Ali R.; Anwar, Ali (ACM, 2024-04-22)
    Federated Learning (FL) has emerged as a powerful approach that enables collaborative distributed model training without the need for data sharing. However, FL grapples with inherent heterogeneity challenges leading to issues such as stragglers, dropouts, and performance variations. Selection of clients to run an FL instance is crucial, but existing strategies introduce biases and participation issues and do not consider resource efficiency. Communication and training acceleration solutions proposed to increase client participation also fall short due to the dynamic nature of system resources. We address these challenges in this paper by designing FLOAT, a novel framework designed to boost FL client resource awareness. FLOAT optimizes resource utilization dynamically for meeting training deadlines, and mitigates stragglers and dropouts through various optimization techniques; leading to enhanced model convergence and improved performance. FLOAT leverages multi-objective Reinforcement Learning with Human Feedback (RLHF) to automate the selection of the optimization techniques and their configurations, tailoring them to individual client resource conditions. Moreover, FLOAT seamlessly integrates into existing FL systems, maintaining non-intrusiveness and versatility for both asynchronous and synchronous FL settings. As per our evaluations, FLOAT increases accuracy by up to 53%, reduces client dropouts by up to 78×, and improves communication, computation, and memory utilization by up to 81×, 44×, and 20× respectively.
  • From Awareness to Action: Exploring End-User Empowerment Interventions for Dark Patterns in UX
    Lu, Yuwen; Zhang, Chao; Yang, Yuewen; Yao, Yaxing; Li, Toby (ACM, 2024-04-23)
    The study of UX dark patterns, i.e., UI designs that seek to manipulate user behaviors, often for the benefit of online services, has drawn significant attention in the CHI and CSCW communities in recent years. To complement previous studies in addressing dark patterns from (1) the designer’s perspective on education and advocacy for ethical designs; and (2) the policymaker’s perspective on new regulations, we propose an end-user-empowerment intervention approach that helps users (1) raise the awareness of dark patterns and understand their underlying design intents; (2) take actions to counter the effects of dark patterns using a web augmentation approach. Through a two-phase co-design study, including 5 co-design workshops (N=12) and a 2-week technology probe study (N=15), we reported findings on the understanding of users' needs, preferences, and challenges in handling dark patterns and investigated the feedback and reactions to users' awareness of and action on dark patterns being empowered in a realistic in-situ setting.