Scholarly Works, Computer Science

Permanent URI for this collection

Research articles, presentations, and other scholarship

Browse

Recent Submissions

Now showing 1 - 20 of 589
  • Mitigating Propagation Failures in Physics-informed Neural Networks using Retain-Resample-Release (R3) Sampling
    Daw, Arka; Bu, Jie; Wang, Sifan; Perdikaris, Paris; Karpatne, Anuj (2022)
    Despite the success of physics-informed neural networks (PINNs) in approximating partial differential equations (PDEs), PINNs can sometimes fail to converge to the correct solution in problems involving complicated PDEs. This is reflected in several recent studies on characterizing the “failure modes” of PINNs, although a thorough understanding of the connection between PINN failure modes and sampling strategies is missing. In this paper, we provide a novel perspective of failure modes of PINNs by hypothesizing that training PINNs relies on successful “propagation” of solution from initial and/or boundary condition points to interior points. We show that PINNs with poor sampling strategies can get stuck at trivial solutions if there are propagation failures, characterized by highly imbalanced PDE residual fields. To mitigate propagation failures, we propose a novel Retain-Resample-Release sampling (R3) algorithm that can incrementally accumulate collocation points in regions of high PDE residuals with little to no computational overhead. We provide an extension of R3 sampling to respect the principle of causality while solving timedependent PDEs. We theoretically analyze the behavior of R3 sampling and empirically demonstrate its efficacy and efficiency in comparison with baselines on a variety of PDE problems.
  • SamBaS: Sampling-Based Stochastic Block Partitioning
    Wanye, Frank; Gleyzer, Vitaliy; Kao, Edward; Feng, Wu-chun (IEEE, 2024-01-25)
    Community detection is a well-studied problem with applications in domains ranging from networking to bioinformatics. Due to the rapid growth in the volume of real-world data, there is growing interest in accelerating contemporary community detection algorithms. However, the more accurate and statistically robust methods tend to be hard to parallelize. One such method is stochastic block partitioning (SBP) – a community detection algorithm that works well on graphs with complex and heterogeneous community structure. In this paper, we present a sampling-based SBP (SamBaS) for accelerating SBP on sparse graphs. We characterize how various graph parameters affect the speedup and result quality of community detection with SamBaS and quantify the trade-offs therein. To evaluate SamBas on real-world web graphs without known ground-truth communities, we introduce partition quality score (PQS), an evaluation metric that outperforms modularity in terms of correlation with F1 score. Overall, SamBaS achieves speedups of up to 10× while maintaining result quality (and even improving result quality by over 150% on certain graphs, relative to F1 score).
  • An Integrated Approach for Accelerating Stochastic Block Partitioning
    Wanye, Frank; Gleyzer, Vitaliy; Kao, Edward; Feng, Wu-chun (IEEE, 2023-01-01)
    Community detection, or graph partitioning, is a fundamental problem in graph analytics with applications in a wide range of domains including bioinformatics, social media analysis, and anomaly detection. Stochastic block partitioning (SBP) is a community detection algorithm based on sequential Bayesian inference. SBP is highly accurate even on graphs with a complex community structure. However, it does not scale well to large real-world graphs that can contain upwards of a million vertices due to its sequential nature. Approximate methods that break computational dependencies improve the scalability of SBP via parallelization and data reduction. However, these relaxations can lead to low accuracy on graphs with complex community structure. In this paper, we introduce additional synchronization steps through vertex-level data batching to improve the accuracy of such methods. We then leverage batching to develop a high-performance parallel approach that improves the scalability of SBP while maintaining accuracy. Our approach is the first to integrate data reduction, shared-memory parallelization, and distributed computation, thus efficiently utilizing distributed computing resources to accelerate SBP. On a one-million vertex graph processed on 64 compute nodes with 128 cores each, our approach delivers a speedup of 322x over the sequential baseline and 6.8x over the distributed-only implementation. To the best of our knowledge, this Graph Challenge submission is the highest-performing SBP implementation to date and the first to process the one-million vertex graph using SBP.
  • Edge-Connected Jaccard Similarity for Graph Link Prediction on FPGA
    Sathre, Paul; Gondhalekar, Atharva; Feng, Wu-chun (IEEE, 2022-01-01)
    Graph analysis is a critical task in many fields, such as social networking, epidemiology, bioinformatics, and fraud de-tection. In particular, understanding and inferring relationships between graph elements lies at the core of many graph-based workloads. Real-world graph workloads and their associated data structures create irregular computational patterns that compli-cate the realization of high-performance kernels. Given these complications, there does not exist a de facto 'best' architecture, language, or algorithmic approach that simultaneously balances performance, energy efficiency, portability, and productivity. In this paper, we realize different algorithms of edge-connected Jaccard similarity for graph link prediction and characterize their performance across a broad spectrum of graphs on an Intel Stratix 10 FPGA. By utilizing a high-level synthesis (HLS)-driven, high-productivity approach (via the C++-based SYCL language) we rapidly prototype two implementations - a from-scratch edge-centric version and a faithfully-ported commodity GPU implementation - which would have been intractable via a hardware description language. With these implementations, we further consider the benefit and necessity of four HLS-enabled optimizations, both in isolation and in concert - totaling seven distinct synthesized hardware pipelines. Leveraging real-world graphs of up to 516 million edges, we show empirically-measured speedups of up to 9.5 x over the initial HLS implementations when all optimizations work in concert.
  • ComputeCOVID19+: Accelerating COVID-19 Diagnosis and Monitoring via High-Performance Deep Learning on CT Images
    Goel, Garvit; Gondhalekar, Atharva; Qi, Jingyuan; Zhang, Zhicheng; Cao, Guohua; Feng, Wu (ACM, 2021-10-05)
    The COVID-19 pandemic has highlighted the importance of diagnosis and monitoring as early and accurately as possible. However, the reverse-transcription polymerase chain reaction (RT-PCR) test results in two issues: (1) protracted turnaround time from sample collection to testing result and (2) compromised test accuracy, as low as 67%, due to when and how the samples are collected, packaged, and delivered to the lab to conduct the RT-PCR test. Thus, we present ComputeCOVID19+, our computed tomography-based framework to improve the testing speed and accuracy of COVID-19 (plus its variants) via a deep learning-based network for CT image enhancement called DDnet, short for DenseNet and Deconvolution network. To demonstrate its speed and accuracy, we evaluate ComputeCOVID19+ across several sources of computed tomography (CT) images and on many heterogeneous platforms, including multi-core CPU, many-core GPU, and even FPGA. Our results show that ComputeCOVID19+ can significantly shorten the turnaround time from days to minutes and improve the testing accuracy to 91%.
  • Scaling out a combinatorial algorithm for discovering carcinogenic gene combinations to thousands of GPUs
    Dash, Sajal; Al-Hajri, Qais; Feng, Wu-chun; Garner, Harold R.; Anandakrishnan, Ramu (IEEE, 2021-05-01)
    Cancer is a leading cause of death in the US, second only to heart disease. It is primarily a result of a combination of an estimated two-nine genetic mutations (multi-hit combinations). Although a body of research has identified hundreds of cancer-causing genetic mutations, we don't know the specific combination of mutations responsible for specific instances of cancer for most cancer types. An approximate algorithm for solving the weighted set cover problem was previously adapted to identify combinations of genes with mutations that may be responsible for individual instances of cancer. However, the algorithm's computational requirement scales exponentially with the number of genes, making it impractical for identifying more than three-hit combinations, even after the algorithm was parallelized and scaled up to a V100 GPU. Since most cancers have been estimated to require more than three hits, we scaled out the algorithm to identify combinations of four or more hits using 1000 nodes (6000 V100 GPUs with ≈ 48× 106 processing cores) on the Summit supercomputer at Oak Ridge National Laboratory. Efficiently scaling out the algorithm required a series of algorithmic innovations and optimizations for balancing an exponentially divergent workload across processors and for minimizing memory latency and inter-node communication. We achieved an average strong scaling efficiency of 90.14% (80.96%-97.96% for 200 to 1000 nodes), compared to a 100 node run, with 84.18% scaling efficiency for 1000 nodes. With experimental validation, the multi-hit combinations identified here could provide further insight into the etiology of different cancer subtypes and provide a rational basis for targeted combination therapy.
  • Rebuttal How-to: Strategies, Tactics, and the Big Picture in Research
    Yao, Danfeng (Daphne) (ACM, 2023-12-21)
    Rebuttals are not published, thus, it is difficult for junior researchers to read successful rebuttals and improve. This article demystifies rebuttal writing by showing the arm-the-champion strategy and a few key tactics. More importantly, we also discuss the conformity nature of conference reviewing and why researchers should not be defeated by paper rejections.
  • StructCoder: Structure-Aware Transformer for Code Generation
    Tipirneni, Sindhu; Zhu, Ming; Reddy, Chandan (ACM, 2024)
    There has been a recent surge of interest in automating software engineering tasks using deep learning. This paper addresses the problem of code generation, where the goal is to generate target code given source code in a different language or a natural language description. Most state-of-the-art deep learning models for code generation use training strategies primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are explicitly trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code?s syntax tree and data flow graph, but we also support the decoder in preserving the syntax and data flow of the target code by introducing two novel auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder that models both syntax and data flow to enhance the quality of generated code. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark, and improves over baselines of similar size on the APPS code generation benchmark. Our code is publicly available at https://github.com/reddy-lab-code-research/StructCoder/.
  • Graph Time-series Modeling in Deep Learning: A Survey
    Chen, Hongjie; Eldardiry, Hoda (ACM, 2024)
    Time-series and graphs have been extensively studied for their ubiquitous existence in numerous domains. Both topics have been separately explored in the field of deep learning. For time-series modeling, recurrent neural networks or convolutional neural networks model the relations between values across time steps, while for graph modeling, graph neural networks model the inter-relations between nodes. Recent research in deep learning requires simultaneous modeling for time-series and graphs when both representations are present. For example, both types of modeling are necessary for time-series classification, regression, and anomaly detection in graphs. This paper aims to provide a comprehensive summary of these models, which we call graph time-series models. To the best of our knowledge, this is the first survey paper that provides a picture of related models from the perspective of deep graph time-series modeling to address a range of time-series tasks, including regression, classification, and anomaly detection. Graph time-series models are split into two categories, a) graph recurrent/convolutional neural networks and b) graph attention neural networks. Under each category, we further categorize models based on their properties. Additionally, we compare representative models and discuss how distinctive model characteristics are utilized with respect to various model components and data challenges. Pointers to commonly used datasets and code are included to facilitate access for further research. In the end, we discuss potential directions for future research.
  • Self-Correlation and Cross-Correlation Learning for Few-Shot Remote Sensing Image Semantic Segmentation
    Wang, Linhan; Lei, Shuo; He, Jianfeng; Wang, Shengkun; Zhang, Min; Lu, Chang-Tien (ACM, 2023-11-13)
    Remote sensing image semantic segmentation is an important problem for remote sensing image interpretation. Although remarkable progress has been achieved, existing deep neural network methods suffer from the reliance on massive training data. Few-shot remote sensing semantic segmentation aims at learning to segment target objects from a query image using only a few annotated support images of the target class. Most existing few-shot learning methods stem primarily from their sole focus on extracting information from support images, thereby failing to effectively address the large variance in appearance and scales of geographic objects. To tackle these challenges, we propose a Self-Correlation and Cross-Correlation Learning Network for the few-shot remote sensing image semantic segmentation. Our model enhances the generalization by considering both self-correlation and cross-correlation between support and query images to make segmentation predictions. To further explore the self-correlation with the query image, we propose to adopt a classical spectral method to produce a class-agnostic segmentation mask based on the basic visual information of the image. Extensive experiments on two remote sensing image datasets demonstrate the effectiveness and superiority of our model in few-shot remote sensing image semantic segmentation. The code is available at https://github.com/linhanwang/SCCNet.
  • Spatial Temporal Graph Neural Networks for Decentralized Control of Robot Swarms
    Chen, Siji; Sun, Yanshen; Li, Peihan; Zhou, Lifeng; Lu, Chang-Tien (ACM, 2023-11-13)
    Recent research has explored the use of graph neural networks (GNNs) for decentralized control in swarm robotics. However, it has been observed that relying solely on local states is insufficient to imitate a centralized control policy. To address this limitation, previous studies proposed incorporating 𝐾-hop delayed states into the computation. While this approach shows promise, it can lead to a lack of consensus among distant flock members and the formation of small localized groups, ultimately resulting in task failure. Our approach is to include the delayed states to build a spatiotemporal GNN model (ST-GNN) by two levels of expansion: spatial expansion and temporal expansion. The spatial expansion utilizes 𝐾-hop delayed states to broaden the network while temporal expansion, can effectively predict the trend of swarm behavior, making it more robust against local noise. To validate the effectiveness of our approach, we conducted simulations in two distinct scenarios: free flocking and flocking with a leader. In both scenarios, the simulation results demonstrated that our decentralized ST-GNN approach successfully overcomes the limitations of local controllers. We performed a comprehensive analysis on the effectiveness of spatial expansions and temporal expansions independently. The results clearly demonstrate that both significantly improve overall performance. Furthermore, when combined, they achieve the best performance compared to global solution and delayed states solutions. The performance of ST-GNN underscores its potential as an effective and reliable approach for achieving cohesive flocking behavior while ensuring safety and maintaining desired swarm characteristics.
  • Triggering Modes in Spectrum-Based Multi-location Fault Localization
    Dao, Tung; Meng, Na; Nguyen, ThanhVu (ACM, 2023-11-30)
    Spectrum-based fault localization (SBFL) techniques can aid in debugging, but their practicality in industrial settings has been limited due to the large number of tests needed to execute before applying SBFL. Previous research has explored different trigger modes for SBFL and found that applying it immediately after the first test failure is also effective. However, this study only considered single-location bugs, while multi-location bugs are prevalent in real-world scenarios and especially at our company Cvent, which is interested in integrating SBFL to its CI/CD workflow. In this work, we investigate the effectiveness of SBFL on multilocation bugs and propose a framework called Instant Fault Localization for Multi-location Bugs (IFLM). We compare and evaluate four trigger modes of IFLM using open-source (Defects4J) and close-source (Cvent) bug datasets. Our study showed that it is not necessary to execute all test cases before applying SBFL. However, we also found that that applying SBFL right after the first failed test is less effective than applying it after executing all tests for multi-location bugs, which is contrary to the single-location bug study. We also observe differences in performance between real and artificial bugs. Our contributions include the development of IFLM and CVent bug datasets, analysis of SBFL effectiveness for multi-location bugs, and practical implications for integrating SBFL in industrial environments.
  • Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics
    Humayun, Ahmad; Kim, Miryung; Gulzar, Muhammad Ali (ACM, 2023-11-30)
    Data-intensive scalable computing has become popular due to the increasing demands of analyzing big data. For example, Apache Spark and Hadoop allow developers to write dataflow-based applications with user-defined functions to process data with custom logic. Testing such applications is difficult. (1) These applications often take multiple datasets as input. (2) Unlike in SQL, there is no explicit schema for these datasets and each unstructured (or semi-structured) dataset is segmented and parsed at runtime. (3) Dataflow operators (e.g., join) create implicit co-dependence constraints between the fields of multiple datasets. An efficient and effective testing technique must analyze co-dependence among different regions of multiple datasets at the level of rows and columns and orchestrate input mutations jointly on co-dependent regions. We propose DepFuzz to increase the effectiveness and efficiency of fuzz testing dataflow-based big data applications. The key insight behind DepFuzz is twofold. It keeps track of which code segments operate on which datasets, which rows, and which columns. By analyzing the use of dataflow operators (e.g., join and groupByKey) in tandem with the semantics of UDFs, DepFuzz generates test data that subsequently reach hard-to-reach regions of the application code. In real-world big data applications, DepFuzz finds 3.4× more faults, achieving 29% more statement coverage in half the time as Jazzer’s, a state-of-the-art commercial fuzzer for Java bytecode. It outperforms prior DISC testing by exposing deeper semantic faults beyond simpler input formatting errors, especially when multiple datasets have complex interactions through dataflow operators.
  • BigDataflow: A Distributed Interprocedural Dataflow Analysis Framework
    Sun, Zewen; Xu, Duanchen; Zhang, Yiyu; Qi, Yun; Wang, Yueyang; Zuo, Zhiqiang; Wang, Zhaokang; Li, Yue; Li, Xuandong; Lu, Qingda; Peng, Wenwen; Guo, Shengjian (ACM, 2023-11-30)
    Apart from forming the backbone of compiler optimization, static dataflow analysis has been widely applied in a vast variety of applications, such as bug detection, privacy analysis, program comprehension, etc. Despite its importance, performing interprocedural dataflow analysis on large-scale programs is well known to be challenging.In this paper, we propose a novel distributed analysis framework supporting the general interprocedural dataflow analysis.Inspired by large-scale graph processing, we devise a dedicated distributed worklist algorithm tailored for interprocedural dataflow analysis. We implement the algorithm and develop a distributed framework called BigDataflow running on a large-scale cluster.The experimental results validate the promising performance of BigDataflow – it can finish analyzing the program of millions lines of code in minutes. Compared with the state-of-the-art, BigDataflow achieves much more analysis efficiency.
  • FedDefender: Backdoor Attack Defense in Federated Learning
    Gill, Waris; Anwar, Ali; Gulzar, Muhammad Ali (ACM, 2023-12-04)
    Federated Learning (FL) is a privacy-preserving distributed machine learning technique that enables individual clients (e.g., user participants, edge devices, or organizations) to train a model on their local data in a secure environment and then share the trained model with an aggregator to build a global model collaboratively. In this work, we propose FedDefender, a defense mechanism against targeted poisoning attacks in FL by leveraging differential testing. FedDefender first applies differential testing on clients’ models using a synthetic input. Instead of comparing the output (predicted label), which is unavailable for synthetic input, FedDefender fingerprints the neuron activations of clients’ models to identify a potentially malicious client containing a backdoor. We evaluate FedDefender using MNIST and FashionMNIST datasets with 20 and 30 clients, and our results demonstrate that FedDefender effectively mitigates such attacks, reducing the attack success rate (ASR) to 10% without deteriorating the global model performance.
  • Arguments for and Approaches to Computing Education in Undergraduate Computer Science Programmes
    Cutts, Quintin; Kallia, Maria; Anderson, Ruth; Crick, Tom; Devlin, Marie; Farghally, Mohammed; Mirolo, Claudio; Runde, Ragnhild Kobro; Seppälä, Otto; Urquiza-Fuentes, Jaime; Vahrenhold, Jan (ACM, 2023-12-22)
    Computing education (CE), the scientific foundation of the teaching and learning of subject matter specific to computing, has matured into a field with its own research journals and conferences as well as graduate programmes. Yet, and unlike other mature subfields of computer science (CS), it is rarely taught as part of undergraduate CS programmes. In this report, we present a gap analysis resulting from semi-structured interviews with various types of stakeholders and derive a set of arguments for teaching CE courses in undergraduate CS programmes. This analysis and the arguments highlight a number of opportunities for the discipline of CS at large, in academia, in industry, and in school education, that would be opened up with undergraduate CE courses, as well as potential barriers to implementation that will need to be overcome. We also report on the results of a Delphi process performed to elicit topics for such a course with various audiences in mind. The Delphi process yielded 19 high-level categories that encompass the subject matter CE courses should incorporate, tailored to the specific needs of their intended student audiences. This outcome underscores the extensive range of content that can be integrated into a comprehensive CE programme. Based on these two stakeholder interactions as well as a systematic literature review aiming to explore the current practices in teaching CE to undergraduate students, we develop two prototypical outlines of such a course, keeping in mind that departments may have different preferences and affordances resulting in different kinds of CE offerings. Overall, input from external stakeholders underscores the clear significance of undergraduate CE courses. We anticipate leveraging this valuable feedback to actively promote these courses on a broader scale.
  • Modeling Women's Elective Choices in Computing
    Bradley, Steven; Parker, Miranda; Altin, Rukiye; Barker, Lecia; Hooshangi, Sara; Kunkeler, Thom; Lennon, Ruth; McNeill, Fiona; Minguillón, Julià; Parkinson, Jack; Peltsverger, Svetlana; Sibia, Naaz (ACM, 2023-12-22)
    Evidence-based strategies suggest ways to reduce the gender gap in computing. For example, elective classes are valuable in enabling students to choose in which directions to expand their computing knowledge in areas aligned with their interests. The availability of electives of interest may also make computing programs of study more meaningful to women. However, research on which elective computing topics are more appealing to women is often class or institution specific. In this study, we investigate differences in enrollment within undergraduate-level elective classes in computing to study differences between women and men. The study combined data from nine institutions from both Western Europe and North America and included 272 different classes with 49,710 student enrollments. These classes were encoded using ACM curriculum guidelines and combined with the enrollment data to build a hierarchical statistical model of factors affecting student choice. Our model shows which elective topics are less popular with all students (including fundamentals of programming languages and parallel and distributed computing), and which elective topics are more popular with women students (including mathematical and statistical foundations, human computer interaction and society, ethics, and professionalism). Understanding which classes appeal to different students can help departments gain insight of student choices and develop programs accordingly. Additionally, these choices can also help departments explore whether some students are less likely to choose certain classes than others, indicating potential barriers to participation in computing.
  • A First Look at Toxicity Injection Attacks on Open-domain Chatbots
    Weeks, Connor; Cheruvu, Aravind; Abdullah, Sifat Muhammad; Kanchi, Shravya; Yao, Daphne; Viswanath, Bimal (ACM, 2023-12-04)
    Chatbot systems have improved significantly because of the advances made in language modeling. These machine learning systems follow an end-to-end data-driven learning paradigm and are trained on large conversational datasets. Imperfections or harmful biases in the training datasets can cause the models to learn toxic behavior, and thereby expose their users to harmful responses. Prior work has focused on measuring the inherent toxicity of such chatbots, by devising queries that are more likely to produce toxic responses. In this work, we ask the question: How easy or hard is it to inject toxicity into a chatbot after deployment? We study this in a practical scenario known as Dialog-based Learning (DBL), where a chatbot is periodically trained on recent conversations with its users after deployment. A DBL setting can be exploited to poison the training dataset for each training cycle. Our attacks would allow an adversary to manipulate the degree of toxicity in a model and also enable control over what type of queries can trigger a toxic response. Our fully automated attacks only require LLM-based software agents masquerading as (malicious) users to inject high levels of toxicity. We systematically explore the vulnerability of popular chatbot pipelines to this threat. Lastly, we show that several existing toxicity mitigation strategies (designed for chatbots) can be significantly weakened by adaptive attackers.
  • Delegation of TLS Authentication to CDNs using Revocable Delegated Credentials
    Yoon, Daegeun; Chung, Taejoong; Kim, Yongdae (ACM, 2023-12-04)
    When using a Content Delivery Network (CDN), domain owners typically delegate Transport Layer Security (TLS) authentication to the CDN by sharing their TLS certificate’s private key. However, this practice not only delegates TLS authentication but also grants the CDN complete control over the certificate. To mitigate these concerns, Delegated Credential (DC) was proposed as a solution; DC, which contains both the CDN’s public key and the domain owner’s signature, allows the domain owners to delegate their own credentials for TLS authentication, thereby avoiding the need to share their private keys. However, the absence of a mechanism to distribute the revocation status of a DC renders it non-revocable, even when a compromise of a credential has been detected. DCs were thus designed to be short-lived, necessitating frequent renewal for continued use. To overcome this limitation, we designed Revocable Delegated Credential (RDC), which provides a revocation method for DCs. With RDCs, there is no need for frequent renewals as they can be revoked, allowing for a longer validity period. The revocation status of RDCs is distributed via DNS, an essential component of web communication. RDCs utilize the NSEC record, a type of DNSSEC record, as a means to store, validate, and easily manage their revocation status. When domain owners no longer trust their CDNs or detect compromise in their RDCs, they can distribute the RDC’s revocation status by simply creating a subdomain named with an RDC identifier. The browser then confirms the existence of this subdomain using the NSEC record to validate the revocation status. We implemented RDC in the go tls package and Firefox Nightly to demonstrate and evaluate its feasibility.
  • An End-to-End High-performance Deduplication Scheme for Docker Registries and Docker Container Storage Systems
    Zhao, Nannan; Lin, Muhui; Albahar, Hadeel; Paul, Arnab K.; Huan, Zhijie; Abraham, Subil; Chen, Keren; Tarasov, Vasily; Skourtis, Dimitrios; Anwar, Ali; Butt, Ali R. (ACM, 2024)
    The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place high pressure on the infrastructure of container registries that store and distribute images and container storage systems on the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost. We propose DupHunter, an end-to-end deduplication that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into 3 tiers: Docker registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the Docker registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier that utilizes the deduplication engine to eliminate the redundant files from the client tier, which avoids introducing deduplication overhead to the Docker client side. To further reduce the data duplicates caused by CoW and improve the container I/O performance, we use a container-aware backing file system at the client tier that preallocates space for each container and ensures that files in a container and its modifications are placed and redirected closer on the disk to maintain locality. Under real workloads, DupHunter reduces storage space by up to 6.9× and reduces the GET layer latency by up to 2.8× compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.