Scholarly Works, Computer Science

Permanent URI for this collection

https://hdl.handle.net/10919/24290

Research articles, presentations, and other scholarship

Browse

Now showing 1 - 20 of 625

Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals
Singhal, Vidush; Sakka, Laith; Sundararajah, Kirshanthan; Newton, Ryan; Kulkarni, Milind (ACM, 2024)
Many applications are designed to perform traversals on tree-like data structures. Fusing and parallelizing these traversals enhance the performance of applications. Fusing multiple traversals improves the locality of the application. The runtime of an application can be significantly reduced by extracting parallelism and utilizing multi-threading. Prior frameworks have tried to fuse and parallelize tree traversals using coarse-grained approaches, leading to missed fine-grained opportunities for improving performance. Other frameworks have successfully supported fine-grained fusion on heterogeneous tree types but fall short regarding parallelization. We introduce a new framework Orchard built on top of Grafter. Orchard’s novelty lies in allowing the programmer to transform tree traversal applications by automatically applying fine-grained fusion and extract- ing heterogeneous parallelism.Orchard allows the programmer to write general tree traversal applications in a simple and elegant embedded Domain-Specific Language (eDSL). We show that the combination of fine-grained fusion and heterogeneous parallelism performs better than each alone when the conditions are met.
ALERTA-Net: A Temporal Distance-Aware Recurrent Networks for Stock Movement and Volatility Prediction
Wang, Shengkun; Bai, Yangxiao; Fu, Kaiqun; Wang, Linhan; Lu, Chang-Tien; Ji, Taoran (ACM, 2023-11-06)
For both investors and policymakers, forecasting the stock market is essential as it serves as an indicator of economic well-being. To this end, we harness the power of social media data, a rich source of public sentiment, to enhance the accuracy of stock market predictions. Diverging from conventional methods, we pioneer an approach that integrates sentiment analysis, macroeconomic indicators, search engine data, and historical prices within a multi-attention deep learning model, masterfully decoding the complex patterns inherent in the data. We showcase the state-of-the-art performance of our proposed model using a dataset, specifically curated by us, for predicting stock market movements and volatility.
Hypergraph Text Classification for Mental Health Misleading Advice
Alkulaib, Lulwah; Alhamadani, Abdulaziz; Sarkar, Shailik; Lu, Chang-Tien (ACM, 2023-11-06)
This paper introduces HyperMAD, a novel Hypergraph Convolutional Network model designed for the multiclass classification of mental health advice in Arabic tweets. The model distinguishes between misleading and valid advice, further categorizing each tweet into specific classes of advice. HyperMAD leverages high-order relations between words in short texts, captured through the definition of four types of hyperedges that represent local and global contexts as well as semantic similarity. Extensive experiments demonstrate the effectiveness of HyperMAD, with results outperforming those from existing baselines. The study also includes an ablation study to investigate the significance and contribution of each hyperedge type. The paper presents a case study analyzing the accuracy and types of Arabic mental health advice on Twitter, revealing that about 9% of the advice in response to mental health expressions on Twitter was accurate in general. The paper concludes with the hope that the application of HyperMAD can be utilized in flagging misleading responses on social media, providing the correct resources for those who choose to share their mental health struggles online.
From Guest to Family: An Innovative Framework for Enhancing Memorable Experiences in the Hotel Industry
Alhamadani, Abdulaziz; Althubiti, Khadija; Sarkar, Shailik; He, Jianfeng; Alkulaib, Lulwah; Behal, Srishti; Khan, Mahmood; Lu, Chang-Tien (ACM, 2023-11-06)
This paper presents an innovative framework developed to identify, analyze, and generate memorable experiences in the hotel industry. People prefer memorable experiences over traditional services or products in today’s ever-changing consumer world. As a result, the hospitality industry has shifted its focus toward creating unique and unforgettable experiences rather than just providing essential services. Despite the inherent subjectivity and difficulties in quantifying experiences, the quest to capture and understand these critical elements in the hospitality context has persisted. However, traditional methods have proven inadequate due to their reliance on objective surveys or limited social media data, resulting in a lack of diversity and potential bias. Our framework addresses these issues, offering a holistic solution that effectively identifies and extracts memorable experiences from online customer reviews, discerns trends on a monthly or yearly basis, and utilizes a local LLM to generate potential, unexplored experiences. As the first successfully deployed, fast, and accurate product of its kind in the industry, This framework significantly contributes to the hotel industry’s efforts to enhance services and create compelling, personalized experiences for its customers.
Towards Establishing a Training Program to Support Future CS Teaching-focused Faculty
Farghally, Mohammed; Seyam, Mohammed; Shaffer, Clifford A. (ACM, 2024-03-07)
Computer Science programs have seen high enrollments in recent years, which contributed to widening the capacity gap. One way to address this problem is to hire more teaching-focused faculty at both research and non-doctoral granting institutions. Although this kind of hiring has already been taking place in several institutions, PhD-granting CS departments have not been able to produce enough PhDs to meet the increasing demand, especially for PhD holders with interest in - and capacity for - teaching. In this paper, we describe our experience with the initial phase of building a training program within our (large, land grant, R1) institution, targeting graduate students interested in pursuing an academic teachingfocused career in CS. Through a semester-long set of meetings, conversations, and activities, we worked with participants on improving their teaching skills and applying effective pedagogies in the classroom. At the end of the semester, we surveyed participants about the value of those meetings to them, ideas for improvement, and perspectives for future directions. Most participants rated the meetings positively in terms of content relevance and usefulness, and the opportunity to connect and interact with other participants and invited faculty members. We also discuss the lessons learned and best practices, which can be widely applied by other departments looking to better prepare their graduate students for a CS teaching-focused faculty position.
Transforming Grading Practices in the Computing Education Community
Decker, Adrienne; Edwards, Stephen H.; McSkimming, Brian; Edmison, Bob; Rorrer, Audrey; Pérez-Quiñones, Manuel A. (ACM, 2024-03-07)
It is often the case that computer science classrooms use traditional grading practices where points are allocated to assignments, mistakes result in point deductions, and assignment scores are combined using some form of weighted averaging to determine grades. Unfortunately, traditional grading practices have been shown to reduce achievement, discourage students, and suppress effort to such an extent that some common elements of traditional grading practices have been termed toxic. Using grades to reward or punish student behavior does not encourage learning and instead increases anxiety and stress. These toxic elements are present throughout computing education and computer science classrooms in the form of late penalties, lack of credit for code that doesn’t compile or pass certain unit tests, among others. These types of metrics, that evaluate behavior are often influenced by implicit bias, factors outside of the classrooms (e.g., part-time employment), and family life situations (e.g., students who are caregivers). Often, students in these situations are disproportionately from low-socioeconomic backgrounds and predominantly students of color. Through this paper, we will present a case for adoption of equitable grading practices and a call for additional support in classroom and teaching technologies as well as support from administrations both at the department and university level. By adopting a community of practice approach, we argue that we can support new faculty making these changes, which would be more equitable and inclusive. Further, these practices have been shown to better support student learning and can help increase student learning gains and retention.
Arabic Sentiment Analysis with Noisy Deep Explainable Model
Atabuzzaman, Md.; Shajalal, Md; Baby, Maksuda Bilkis; Boden, Alexander (ACM, 2023-12-15)
Sentiment Analysis (SA) is an essential task for numerous realworld applications. However, the majority of SA research focuses on high-resource languages such as English and Chinese, while limited-resource languages like Arabic and Bengali receive less attention. Additionally, existing Arabic sentiment analysis methods based on advanced artificial intelligence (AI) approaches tend to operate as black boxes, making it challenging to comprehend the reasoning behind their predictions. This paper proposes an explainable sentiment classification framework for the Arabic language. We introduce a noise layer to different deep learning (DL) models, including BiLSTM and CNN-BiLSTM, to address the issue of overfitting. The proposed framework enables the explanation of specific predictions by training a local surrogate explainable model, shedding light on the reasons behind each sentiment prediction (positive or negative). Experiments were conducted on publicly available benchmark Arabic SA datasets, and the results demonstrated that the inclusion of noise layers in the DL model improves performance for the Arabic language by mitigating overfitting. Our method also outperformed several state-of-the-art approaches. Moreover, the introduction of explainability with the noise layer enhances transparency and accountability, making the model suitable for practical adoption in AI-enabled systems.
Mitigating Propagation Failures in Physics-informed Neural Networks using Retain-Resample-Release (R3) Sampling
Daw, Arka; Bu, Jie; Wang, Sifan; Perdikaris, Paris; Karpatne, Anuj (2022)
Despite the success of physics-informed neural networks (PINNs) in approximating partial differential equations (PDEs), PINNs can sometimes fail to converge to the correct solution in problems involving complicated PDEs. This is reflected in several recent studies on characterizing the “failure modes” of PINNs, although a thorough understanding of the connection between PINN failure modes and sampling strategies is missing. In this paper, we provide a novel perspective of failure modes of PINNs by hypothesizing that training PINNs relies on successful “propagation” of solution from initial and/or boundary condition points to interior points. We show that PINNs with poor sampling strategies can get stuck at trivial solutions if there are propagation failures, characterized by highly imbalanced PDE residual fields. To mitigate propagation failures, we propose a novel Retain-Resample-Release sampling (R3) algorithm that can incrementally accumulate collocation points in regions of high PDE residuals with little to no computational overhead. We provide an extension of R3 sampling to respect the principle of causality while solving timedependent PDEs. We theoretically analyze the behavior of R3 sampling and empirically demonstrate its efficacy and efficiency in comparison with baselines on a variety of PDE problems.
SamBaS: Sampling-Based Stochastic Block Partitioning
Wanye, Frank; Gleyzer, Vitaliy; Kao, Edward; Feng, Wu-chun (IEEE, 2024-01-25)
Community detection is a well-studied problem with applications in domains ranging from networking to bioinformatics. Due to the rapid growth in the volume of real-world data, there is growing interest in accelerating contemporary community detection algorithms. However, the more accurate and statistically robust methods tend to be hard to parallelize. One such method is stochastic block partitioning (SBP) – a community detection algorithm that works well on graphs with complex and heterogeneous community structure. In this paper, we present a sampling-based SBP (SamBaS) for accelerating SBP on sparse graphs. We characterize how various graph parameters affect the speedup and result quality of community detection with SamBaS and quantify the trade-offs therein. To evaluate SamBas on real-world web graphs without known ground-truth communities, we introduce partition quality score (PQS), an evaluation metric that outperforms modularity in terms of correlation with F1 score. Overall, SamBaS achieves speedups of up to 10× while maintaining result quality (and even improving result quality by over 150% on certain graphs, relative to F1 score).
Parallel Programming with Pictures: Choosing Your Own Adventure
Feng, Wu-chun; Davis-Wallace, Liam (IEEE, 2023)
Given the ubiquity of parallel computing hardware, we introduced parallelprogramming with pictures to the block-based Snap! environment and called it pSnap!, short for parallel Snap! We then created an accessible curriculum for students of all ages to learn how to program serially and then how to program with explicit parallelism. This paper presents a new and innovative extension to our curriculum on parallel programming with pSnap!, one that broadens its appeal to the masses by teaching the application of parallel programming as a ''choose your own learning adventure'' activity, inspired by the Choose Your Own Adventure book series of the 1980s and 1990s. Specifically, after students learn the basics of parallel programming with pictures, they are ready to choose their next learning adventure, which applies their newfound parallel programming skills to create a video game of their choice, i.e., Missile Command or Do You Want to Build a Snowman?
On the Multi-Dimensional Acceleration of Stochastic Blockmodeling for Community Detection
Wanye, Frank; Feng, Wu-chun (IEEE, 2023-01-01)
Stochastic block partitioning (SBP) is a community detection algorithm that is highly accurate even on graphs with a complex community structure. However, SBP is much slower than more commonly used algorithms, such as Louvain, making SBP impractical for analyzing large real-world graphs with millions of edges. Thus, we aim to realize fast and accurate community detection on large graphs by accelerating the highly accurate SBP algorithm via sampling, parallel and distributed computing on a cluster as well as algorithmic optimization. We compare our approach to other community detection algorithms, showing that SBP accelerated with our methods on 64 compute nodes is up to 1,163× faster than the official "Graph Challenge"baseline SBP implementation, while still being more accurate than the Louvain and Leiden algorithms on large graphs.
Exact Distributed Stochastic Block Partitioning
Wanye, Frank; Gleyzer, Vitaliy; Kao, Edward; Feng, Wu-chun (IEEE, 2023-01-01)
Stochastic block partitioning (SBP) is a community detection algorithm that is highly accurate even on graphs with a complex community structure, but its inherently serial nature hinders its widespread adoption by the wider scientific community. To make it practical to analyze large real-world graphs with SBP, there is a growing need to parallelize and distribute the algorithm. The current state-of-the-art distributed SBP algorithm is a divide-and-conquer approach that limits communication between compute nodes until the end of inference. This leads to the breaking of computational dependencies, which causes convergence issues as the number of compute nodes increases and when the graph is sufficiently sparse. To address this shortcoming, we introduce EDiSt - an exact distributed stochastic block partitioning algorithm. Under EDiSt, compute nodes periodically share community assignments during inference. Due to this additional communication, EDiSt improves upon the divide-and-conquer algorithm by allowing it to scale out to a larger number of compute nodes without suffering from convergence issues, even on sparse graphs. We show that EDiSt provides speedups of up to 26.9× over the divide-and-conquer approach and speedups up to 44.0× over shared memory parallel SBP when scaled out to 64 compute nodes.
On the Characterization of the Performance-Productivity Gap for FPGA
Gondhalekar, Atharva; Twomey, Thomas; Feng, Wu-chun (IEEE, 2022)
Today, FPGA vendors provide a C++/C-based programming environment to enhance programmer productivity over using a hardware-description language at the register-transfer level. The common perception is that this enhanced pro-ductivity comes at the expense of significantly less performance, e.g., as much an order of magnitude worse. To characterize this performance-productivity tradeoff, we propose a new composite metric, II, that quantitatively captures the perceived discrepancy between the performance and productivity of any two given FPGA programming languages, e.g., Verilog vs. OpenCL. We then present the implications of our metric via a case study on the design of a Sobel filter (i.e., edge detector) using three different programming models - Verilog, OpenCL, oneAPI - on an Intel Arria 10 GX FPGA accelerator. Relative to performance, our results show that an optimized OpenCL kernel achieves 84% of the performance of an optimized Verilog version of the code on a 7680×4320 (8K) image. Conversely, relative to productivity, OpenCL offers a 6.1 x improvement in productivity over Verilog, while oneAPI improves the productivity by an additional factor of 1.25 x over OpenCL.
AUTOPAGER: Auto-tuning Memory-Mapped I/O Parameters in Userspace
Youssef, Karim; Shah, Niteya; Gokhale, Maya; Pearce, Roger; Feng, Wu-chun (IEEE, 2022)
The exponential growth in dataset sizes has shifted the bottleneck of high-performance data analytics from the compute subsystem to the memory and storage subsystems. This bottleneck has led to the proliferation of non-volatile memory (NVM). To bridge the performance gap between the Linux I/O subsystem and NVM, userspace memory-mapped I/O enables application-specific I/O optimizations. Specifically, UMap, an open-source userspace memory-mapping tool, exposes tunable paging parameters to application users, such as page size and degree of paging concurrency. Tuning these parameters is computationally intractable due to the vast search space and the cost of evaluating each parameter combination. To address this challenge, we present Autopager, a tool for auto-tuning userspace paging parameters. Our evaluation, using five data-intensive applications with UMap, shows that Autopager automatically achieves comparable performance to exhaustive tuning with 10 x less tuning overhead. and 16.3 x and 1.52 x speedup over UMap with default parameters and UMap with page-size only tuning, respectively.
On the Three P's of Parallel Programming for Heterogeneous Computing: Performance, Productivity, and Portability
Gondhalekar, Atharva; Feng, Wu-chun (IEEE, 2023-01-01)
As FPGAs and GPUs continue to make inroads into high-performance computing (HPC), the need for languages and frameworks that offer performance, productivity, and portability across heterogeneous platforms, such as FPGAs and GPUs, continues to grow. OpenCL and SYCL have emerged as frameworks that offer cross-platform functional portability between FPGAs and GPUs. While functional portability across a diverse set of platforms is an important feature of portable frameworks, achieving performance portability often requires vendor and platform-specific optimizations. Achieving performance portability, therefore, comes at the expense of productivity. This paper presents a quantification of the tradeoffs between performance, portability, and productivity of OpenCL and SYCL. It extends and complements our prior work on quantifying performance-productivity tradeoffs between Verilog and OpenCL for the FPGA. In addition to evaluating the performance-productivity tradeoffs between OpenCL and SYCL, this work quantifies the performance portability (PP) of OpenCL and SYCL as well as their code convergence (CC), i.e., a measure of productivity across different platforms (e.g., FPGA and GPU). Using two applications as case studies (i.e., edge detection using the Sobel filter, and graph link prediction using the Jaccard similarity index), we characterize the tradeoffs between performance, portability, and productivity. Our results show that OpenCL and SYCL offer complementary tradeoffs. While OpenCL delivers better performance portability than SYCL, SYCL offers better code convergence and a 1.6 x improvement in source lines of code over OpenCL.
An Integrated Approach for Accelerating Stochastic Block Partitioning
Wanye, Frank; Gleyzer, Vitaliy; Kao, Edward; Feng, Wu-chun (IEEE, 2023-01-01)
Community detection, or graph partitioning, is a fundamental problem in graph analytics with applications in a wide range of domains including bioinformatics, social media analysis, and anomaly detection. Stochastic block partitioning (SBP) is a community detection algorithm based on sequential Bayesian inference. SBP is highly accurate even on graphs with a complex community structure. However, it does not scale well to large real-world graphs that can contain upwards of a million vertices due to its sequential nature. Approximate methods that break computational dependencies improve the scalability of SBP via parallelization and data reduction. However, these relaxations can lead to low accuracy on graphs with complex community structure. In this paper, we introduce additional synchronization steps through vertex-level data batching to improve the accuracy of such methods. We then leverage batching to develop a high-performance parallel approach that improves the scalability of SBP while maintaining accuracy. Our approach is the first to integrate data reduction, shared-memory parallelization, and distributed computation, thus efficiently utilizing distributed computing resources to accelerate SBP. On a one-million vertex graph processed on 64 compute nodes with 128 cores each, our approach delivers a speedup of 322x over the sequential baseline and 6.8x over the distributed-only implementation. To the best of our knowledge, this Graph Challenge submission is the highest-performing SBP implementation to date and the first to process the one-million vertex graph using SBP.
Optimizing Performance and Storage of Memory-Mapped Persistent Data Structures
Youssef, Karim; Raqibul Islam, Abdullah A.; Iwabuchi, Keita; Feng, Wu-chun; Pearce, Roger (IEEE, 2022-01-01)
Persistent data structures represent a core component of high-performance data analytics. Multiple data processing systems persist data structures using memory-mapped files. Memory-mapped file I/O provides a productive and unified programming interface to different types of storage systems. However, it suffers from multiple limitations, including performance bottlenecks caused by system-wide configurations and a lack of support for efficient incremental versioning. There-fore, many such systems only support versioning via full-copy snapshots, resulting in poor performance and storage capacity bottlenecks. To address these limitations, we present Privateer 2.0, a virtual memory and storage interface that optimizes performance and storage capacity for versioned persistent data structures. Privateer 2.0 improves over the previous version by supporting userspace virtual memory management and block compression. We integrated Privateer 2.0 into Metall, a C++ persistent data structure allocator, and LMDB, a widely-used key-value store database. Privateer 2.0 yielded up to 7.5× speedup and up to 300× storage space reduction for Metall incremental snapshots and 1.25× speedup with 11.7× storage space reduction for LMDB incremental snapshots.
Edge-Connected Jaccard Similarity for Graph Link Prediction on FPGA
Sathre, Paul; Gondhalekar, Atharva; Feng, Wu-chun (IEEE, 2022-01-01)
Graph analysis is a critical task in many fields, such as social networking, epidemiology, bioinformatics, and fraud de-tection. In particular, understanding and inferring relationships between graph elements lies at the core of many graph-based workloads. Real-world graph workloads and their associated data structures create irregular computational patterns that compli-cate the realization of high-performance kernels. Given these complications, there does not exist a de facto 'best' architecture, language, or algorithmic approach that simultaneously balances performance, energy efficiency, portability, and productivity. In this paper, we realize different algorithms of edge-connected Jaccard similarity for graph link prediction and characterize their performance across a broad spectrum of graphs on an Intel Stratix 10 FPGA. By utilizing a high-level synthesis (HLS)-driven, high-productivity approach (via the C++-based SYCL language) we rapidly prototype two implementations - a from-scratch edge-centric version and a faithfully-ported commodity GPU implementation - which would have been intractable via a hardware description language. With these implementations, we further consider the benefit and necessity of four HLS-enabled optimizations, both in isolation and in concert - totaling seven distinct synthesized hardware pipelines. Leveraging real-world graphs of up to 516 million edges, we show empirically-measured speedups of up to 9.5 x over the initial HLS implementations when all optimizations work in concert.
ComputeCOVID19+: Accelerating COVID-19 Diagnosis and Monitoring via High-Performance Deep Learning on CT Images
Goel, Garvit; Gondhalekar, Atharva; Qi, Jingyuan; Zhang, Zhicheng; Cao, Guohua; Feng, Wu-chun (ACM, 2021-10-05)
The COVID-19 pandemic has highlighted the importance of diagnosis and monitoring as early and accurately as possible. However, the reverse-transcription polymerase chain reaction (RT-PCR) test results in two issues: (1) protracted turnaround time from sample collection to testing result and (2) compromised test accuracy, as low as 67%, due to when and how the samples are collected, packaged, and delivered to the lab to conduct the RT-PCR test. Thus, we present ComputeCOVID19+, our computed tomography-based framework to improve the testing speed and accuracy of COVID-19 (plus its variants) via a deep learning-based network for CT image enhancement called DDnet, short for DenseNet and Deconvolution network. To demonstrate its speed and accuracy, we evaluate ComputeCOVID19+ across several sources of computed tomography (CT) images and on many heterogeneous platforms, including multi-core CPU, many-core GPU, and even FPGA. Our results show that ComputeCOVID19+ can significantly shorten the turnaround time from days to minutes and improve the testing accuracy to 91%.
Scaling out a combinatorial algorithm for discovering carcinogenic gene combinations to thousands of GPUs
Dash, Sajal; Al-Hajri, Qais; Feng, Wu-chun; Garner, Harold R.; Anandakrishnan, Ramu (IEEE, 2021-05-01)
Cancer is a leading cause of death in the US, second only to heart disease. It is primarily a result of a combination of an estimated two-nine genetic mutations (multi-hit combinations). Although a body of research has identified hundreds of cancer-causing genetic mutations, we don't know the specific combination of mutations responsible for specific instances of cancer for most cancer types. An approximate algorithm for solving the weighted set cover problem was previously adapted to identify combinations of genes with mutations that may be responsible for individual instances of cancer. However, the algorithm's computational requirement scales exponentially with the number of genes, making it impractical for identifying more than three-hit combinations, even after the algorithm was parallelized and scaled up to a V100 GPU. Since most cancers have been estimated to require more than three hits, we scaled out the algorithm to identify combinations of four or more hits using 1000 nodes (6000 V100 GPUs with ≈ 48× 106 processing cores) on the Summit supercomputer at Oak Ridge National Laboratory. Efficiently scaling out the algorithm required a series of algorithmic innovations and optimizations for balancing an exponentially divergent workload across processors and for minimizing memory latency and inter-node communication. We achieved an average strong scaling efficiency of 90.14% (80.96%-97.96% for 200 to 1000 nodes), compared to a 100 node run, with 84.18% scaling efficiency for 1000 nodes. With experimental validation, the multi-hit combinations identified here could provide further insight into the etiology of different cancer subtypes and provide a rational basis for targeted combination therapy.

Browse

Recent Submissions