Browsing by Author "Butt, Ali"
Now showing 1 - 20 of 22
Results Per Page
Sort Options
- Analyzing Networks with Hypergraphs: Detection, Classification, and PredictionAlkulaib, Lulwah Ahmad KH M. (Virginia Tech, 2024-04-02)Recent advances in large graph-based models have shown great performance in a variety of tasks, including node classification, link prediction, and influence modeling. However, these graph-based models struggle to capture high-order relations and interactions among entities effectively, leading them to underperform in many real-world scenarios. This thesis focuses on analyzing networks using hypergraphs for detection, classification, and prediction methods in social media-related problems. In particular, we study four specific applications with four proposed novel methods: detecting topic-specific influential users and tweets via hypergraphs; detecting spatiotemporal, topic-specific, influential users and tweets using hypergraphs; augmenting data in hypergraphs to mitigate class imbalance issues; and introducing a novel hypergraph convolutional network model designed for the multiclass classification of mental health advice in Arabic tweets. For the first method, existing solutions for influential user detection did not consider topics that could produce incorrect results and inadequate performance in that task. The proposed contributions of our work include: 1) Developing a hypergraph framework that detects influential users and tweets. 2) Proposing an effective topic modeling method for short texts. 3) Performing extensive experiments to demonstrate the efficacy of our proposed framework. For the second method, we extend the first method by incorporating spatiotemporal information into our solution. Existing influencer detection methods do not consider spatiotemporal influencers in social media, although influence can be greatly affected by geolocation and time. The contributions of our work for this task include: 1) Proposing a hypergraph framework that spatiotemporally detects influential users and tweets. 2) Developing an effective topic modeling method for short texts that geographically provides the topic distribution. 3) Designing a spatiotemporal topic-specific influencer user ranking algorithm. 4) Performing extensive experiments to demonstrate the efficacy of our proposed framework. For the third method, we address the challenge of bot detection on social media platform X, where there's an inherent imbalance between genuine users and bots, a key factor leading to biased classifiers. Our approach leverages the rich structure of hypergraphs to represent X users and their interactions, providing a novel foundation for effective bot detection. The contributions of our work include: 1) Introducing a hypergraph representation of the X platform, where user accounts are nodes and their interactions form hyperedges, capturing the intricate relationships between users. 2) Developing HyperSMOTE to generate synthetic bot accounts within the hypergraph, ensuring a balanced training dataset while preserving the hypergraph's structure and semantics. 3) Designing a hypergraph neural network specifically for bot detection, utilizing node and hyperedge information for accurate classification. 4) Conducting comprehensive experiments to validate the effectiveness of our methods, particularly in scenarios with pronounced class imbalances. For the fourth method, we introduce a Hypergraph Convolutional Network model for classifying mental health advice in Arabic tweets. Our model distinguishes between valid and misleading advice, leveraging high-order word relations in short texts through hypergraph structures. Our extensive experiments demonstrate its effectiveness over existing methods. The key contributions of our work include: 1) Developing a hypergraph-based model for short text multiclass classification, capturing complex word relationships through hypergraph convolution. 2) Defining four types of hyperedges to encapsulate local and global contexts and semantic similarities in our dataset. 3) Conducting comprehensive experiments in which the proposed model outperforms several baseline models in classifying Arabic tweets, demonstrating its superiority. For the fifth method, we extended our previous Hypergraph Convolutional Network (HCN) model to be tailored for sarcasm detection across multiple low-resource languages. Our model excels in interpreting the subtle and context-dependent nature of sarcasm in short texts by exploiting the power of hypergraph structures to capture complex, high-order relationships among words. Through the construction of three hyperedge types, our model navigates the intricate semantic and sentiment differences that characterize sarcastic expressions. The key contributions of our research are as follows: 1) A hypergraph-based model was adapted for the task of sarcasm detection in five short low-resource language texts, allowing the model to capture semantic relationships and contextual cues through advanced hypergraph convolution techniques. 2) Introducing a comprehensive framework for constructing hyperedges, incorporating short text, semantic similarity, and sentiment discrepancy hyperedges, which together enrich the model's ability to understand and detect sarcasm across diverse linguistic contexts. 3) The extensive evaluations reveal that the proposed hypergraph model significantly outperforms a range of established baseline methods in the domain of multilingual sarcasm detection, establishing new benchmarks for accuracy and generalizability in detecting sarcasm within low-resource languages.
- Circuit Support for Practical and Performant Batteryless SystemsWilliams, Harrison Ridgway (Virginia Tech, 2024-06-03)Tiny, ultra-low-power embedded processors enable sophisticated computing deployments in a myriad of areas previously off limits to computing power, ranging from intelligent medical implants to massive scale 'smart dust'-type sensing deployments. While today's computing and sensing hardware is well-suited for these next generation deployments, the batteries powering them are not: the size and weight of today's mobile and Internet-of-Things devices are dominated by their batteries, which also limit systems' lifespans and potential for deployment in sensitive contexts. Academic efforts have demonstrated the feasibility of harvesting energy on-demand from the environment as a practical alternative to classical battery power, instead buffering harvested energy in a capacitor to power intermittent bursts of operation. Energy harvesting circuits are miniaturizable, inexpensive, and enable effectively indefinite operation when compared to batteries---but introduce new problems stemming from the lack of a reliable power source. Unfortunately, these problems have so far confined batteryless systems to small-scale research deployments. The central design challenge for effective batteryless operation is efficiently using scarce input power from the energy harvesting frontend. Despite advances in both harvester and processor efficiency, digital systems often consume orders of magnitude more power than can be supplied by harvesting circuits---forcing systems to operate in short bursts punctuated by power failure and a long recharge period. Today's batteryless systems pay a steep price to sustain operation across these common-case power losses: current platforms depend on high-performance non-volatile memory to quickly and efficiently checkpoint program state before power loss, limiting batteryless operation to a small selection of devices which integrate these novel memory technologies. Choosing exactly when to checkpoint to non-volatile memory represents a challenge in itself: the hardware required to detect impending power failure often represents a large proportion of the system's overall energy consumption, forcing designers to choose between the energy overhead of voltage monitoring or the runtime overhead of 'energy-oblivious' checkpointing models. Finally, the choice of buffer capacitor size has a large impact on overall energy efficiency---but the optimal choice depends on runtime energy dynamics which are difficult to predict at design time, leaving designers to make at best educated guesses about future environmental conditions. This work approaches energy harvesting system design from a circuits perspective, answering the following research questions towards practical and performant batteryless operation: 1. Can the emergent properties of today's low-power systems be used to enable efficient intermittent operation on new classes of devices? 2. What compromises can we make in voltage monitor design to minimize power consumption while maintaining just enough functionality for batteryless operation? 3. How can we buffer harvested energy in a way that maximizes energy efficiency despite unpredictable system-level power dynamics? This work answers the following questions by producing the following research artifacts: 1. The first non-volatile memory invariant system to enable intermittent operation on embedded devices lacking high-performance memory (Chapter 2). 2. The first voltage monitoring circuit designed for batteryless systems to enable energy-aware operation without sacrificing efficiency (Chapter 3). 3. The first highly efficient power-adaptive energy buffer to store harvested energy without compromising on efficiency or performance (Chapter 4).
- Compiler Support for Long-life, Low-overhead Intermittent Computation on Energy Harvesting Flash-based DevicesAhmad, Saim (Virginia Tech, 2021-05-19)With the advent of energy harvesters, supporting fast and efficient computation on energy harvesting devices has become a key challenge in the field of energy harvesting on ubiquitous devices. Computation on energy harvesting devices is equivalent to spreading the execution time of a lasting application over short, frequent cycles of power. However, we must ensure that results obtained from intermittently executing an application do produce results that are congruent to those produced by executing the application on a device with a continuous source of power. The current state-of-the-art systems that enable intermittent computation on energy harvesters make use of novel compiler analysis techniques as well as on-board hardware on devices to measure the energy remaining for useful computation. However, currently available programming models, which mostly target devices with FRAM as the NVM, would cause failure on devices that employ the Flash as primary NVM, thereby resulting in a non-universal solution that is restricted by the choice of NVM. This is primarily the result of the Flash's limited read/write endurance. This research aims to contribute to the world of energy harvesting devices by providing solutions that would enable intermittent computation regardless of the choice of NVM on a device by utilizing only the SRAM to save state and perform computation. Utilizing the SRAM further reduces run-time overhead as SRAM reads/writes are less costlier than NVM reads/writes. Our proposed solutions rely on programmer-guidance and compiler analysis to correct and efficient intermittent computation. We then extend our system to provide a complete compiler-based solution without programmer intervention. Our system is able to run applications that would otherwise render any device with Flash as NVM useless in a matter of hours.
- A Deep Learning Approach to Side-Channel Analysis of Cryptographic HardwareRamezanpour, Keyvan (Virginia Tech, 2020-09-08)With increased growth of the Internet of Things (IoT) and physical exposure of devices to adversaries, a class of physical attacks called side-channel analysis (SCA) has emerged which compromises the security of systems. While security claims of cryptographic algorithms are based on the complexity of classical cryptanalysis attacks, they exclude information leakage by implementations on hardware platforms. Recent standardization processes require assessment of hardware security against SCA. In this dissertation, we study SCA based on deep learning techniques (DL-SCA) as a universal analysis toolbox for assessing the leakage of secret information by hardware implementations. We demonstrate that DL-SCA techniques provide a trade-off between the amount of prior knowledge of a hardware implementation and the amount of measurements required to identify the secret key. A DL-SCA based on supervised learning requires a training set, including information about the details of the hardware implementation, for a successful attack. Supervised learning has been widely used in power analysis (PA) to recover the secret key with a limited size of measurements. We demonstrate a similar trend in fault injection analysis (FIA) by introducing fault intensity map analysis with a neural network key distinguisher (FIMA-NN). We use dynamic timing simulations on an ASIC implementation of AES to develop a statistical model for biased fault injection. We employ the model to train a convolutional neural network (CNN) key distinguisher that achieves a superior efficiency, nearly $10times$, compared to classical FIA techniques. When a priori knowledge of the details of hardware implementations is limited, we propose DL-SCA techniques based on unsupervised learning, called SCAUL, to extract the secret information from measurements without requiring a training set. We further demonstrate the application of reinforcement learning by introducing the SCARL attack, to estimate a proper model for the leakage of secret data in a self-supervised approach. We demonstrate the success of SCAUL and SCARL attacks using power measurements from FPGA implementations of the AES and Ascon authenticated ciphers, respectively, to recover entire 128-bit secret keys without using any prior knowledge or training data.
- Designing RDMA-based efficient Communication for GPU RemotingBhandare, Shreya Amit (Virginia Tech, 2023-08-24)The use of General Purpose Graphics Processing Units (GPGPUs) has become crucial for accelerating high-performance applications. However, the procurement, setup, and maintenance of GPUs can be costly, and their continuous energy consumption poses additional challenges. Moreover, many applications exhibit suboptimal GPU utilization. To address these concerns, GPU virtualization techniques have been proposed. Among them, GPU Remoting stands out as a promising technology that enables applications to transparently harness the computational capabilities of GPUs remotely. GVirtuS, a GPU Remoting software, facilitates transparent and hypervisor-independent access to GPGPUs within virtual machines. This research focuses on the middleware communication layer implemented in GVirtuS and presents a comprehensive redesign that leverages the power of Remote Direct Memory Access (RDMA) technology. Experimental evaluations, conducted using a matrix multiplication application, demonstrate that the newly proposed protocol achieves approximately 50% reduced execution time for data sizes ranging from 1 to 16MB, and around 12% decreased execution time for sizes ranging from 500 to upto 1GB. These findings highlight the significant performance improvements attained through the redesign of the communication layer in GVirtuS, showcasing its potential for enhancing GPU Remoting efficiency.
- Detecting Persistence Bugs from Non-volatile Memory Programs by Inferring Likely-correctness ConditionsFu, Xinwei (Virginia Tech, 2022-03-10)Non-volatile main memory (NVM) technologies are revolutionizing the entire computing stack thanks to their storage-and-memory-like characteristics. The ability to persist data in memory provides a new opportunity to build crash-consistent software without paying a storage stack I/O overhead. A crash-consistent NVM program can recover back to a consistent state from a persistent NVM in the event of a software crash or a sudden power loss. In the presence of a volatile cache, data held in a volatile cache is lost after a crash. So NVM programming requires users to manually control the durability and the persistence ordering of NVM writes. To avoid performance overhead, developers have devised customized persistence mechanisms to enforce proper persistence ordering and atomicity guarantees, rendering NVM programs error-prone. The problem statement of this dissertation is how one can effectively detect persistence bugs from NVM programs. However, detecting persistence bugs in NVM programs is challenging because of the huge test space and the manual consistency validation required. The thesis of this dissertation is that we can detect persistence bugs from NVM programs in a scalable and automatic manner by inferring likely-correctness conditions from programs. A likely-correctness condition is a possible correctness condition, which is a condition a program must maintain to make the program crash-consistent. This dissertation proposes to infer two forms of likely-correctness conditions from NVM programs to detect persistence bugs. The first proposed solution is to infer likely-ordering and likely-atomicity conditions by analyzing program dependencies among NVM accesses. The second proposed solution is to infer likely-linearization points to understand a program's operation-level behavior. Using these two forms of likely-correctness conditions, we test only those NVM states and thread interleavings that violate the likely-correctness conditions. This significantly re- duces the test space required to examine. We then leverage the durable linearizability model to validate consistency automatically without manual consistency validation. In this way, we can detect persistence bugs from NVM programs in a scalable and automatic manner. In total, we detect 47 (36 new) persistence correctness bugs and 158 (113 new) persistence performance bugs from 20 single-threaded NVM programs. Additionally, we detect 27 (15 new) persistence correctness bugs from 12 multi-threaded NVM data structures.
- EdgeFn: A Lightweight Customizable Data Store for Serverless Edge ComputingPaidiparthy, Manoj Prabhakar (Virginia Tech, 2023-06-01)Serverless Edge Computing is an extension of the serverless computing paradigm that enables the deployment and execution of modular software functions on resource-constrained edge devices. However, it poses several challenges due to the edge network's dynamic nature and serverless applications' latency constraints. In this work, we introduce EdgeFn, a lightweight distributed data store for the serverless edge computing system. While serverless comput- ing platforms simplify the development and automated management of software functions, running serverless applications reliably on resource-constrained edge devices poses multiple challenges. These challenges include a lack of flexibility, minimum control over management policies, high data shipping, and cold start latencies. EdgeFn addresses these challenges by providing distributed data storage for serverless applications and allows users to define custom policies that affect the life cycle of serverless functions and their objects. First, we study the challenges of existing serverless systems to adapt to the edge environment. Sec- ond, we propose a distributed data store on top of a Distributed Hash Table (DHT) based Peer-to-Peer (P2P) Overlay, which achieves data locality by co-locating the function and its data. Third, we implement programmable callbacks for storage operations which users can leverage to define custom policies for their applications. We also define some use cases that can be built using the callbacks. Finally, we evaluate EdgeFn scalability and performance using industry-generated trace workload and real-world edge applications.
- Energy Efficient Deep Spiking Recurrent Neural Networks: A Reservoir Computing-Based ApproachHamedani, Kian (Virginia Tech, 2020-06-18)Recurrent neural networks (RNNs) have been widely used for supervised pattern recognition and exploring the underlying spatio-temporal correlation. However, due to the vanishing/exploding gradient problem, training a fully connected RNN in many cases is very difficult or even impossible. The difficulties of training traditional RNNs, led us to reservoir computing (RC) which recently attracted a lot of attention due to its simple training methods and fixed weights at its recurrent layer. There are three different categories of RC systems, namely, echo state networks (ESNs), liquid state machines (LSMs), and delayed feedback reservoirs (DFRs). In this dissertation a novel structure of RNNs which is inspired by dynamic delayed feedback loops is introduced. In the reservoir (recurrent) layer of DFR, only one neuron is required which makes DFRs extremely suitable for hardware implementations. The main motivation of this dissertation is to introduce an energy efficient, and easy to train RNN while this model achieves high performances in different tasks compared to the state-of-the-art. To improve the energy efficiency of our model, we propose to adopt spiking neurons as the information processing unit of DFR. Spiking neural networks (SNNs) are the most biologically plausible and energy efficient class of artificial neural networks (ANNs). The traditional analog ANNs have marginal similarity with the brain-like information processing. It is clear that the biological neurons communicate together through spikes. Therefore, artificial SNNs have been introduced to mimic the biological neurons. On the other hand, the hardware implementation of SNNs have shown to be extremely energy efficient. Towards achieving this overarching goal, this dissertation presents a spiking DFR (SDFR) with novel encoding schemes, and defense mechanisms against adversarial attacks. To verify the effectiveness and performance of the SDFR, it is adopted in three different applications where there exists a significant Spatio-temporal correlations. These three applications are attack detection in smart grids, spectrum sensing of multi-input-multi-output(MIMO)-orthogonal frequency division multiplexing (OFDM) Dynamic Spectrum Sharing (DSS) systems, and video-based face recognition. In this dissertation, the performance of SDFR is first verified in cyber attack detection in Smart grids. Smart grids are a new generation of power grids which guarantee a more reliable and efficient transmission and delivery of power to the costumers. A more reliable and efficient power generation and distribution can be realized through the integration of internet, telecommunication, and energy technologies. The convergence of different technologies, brings up opportunities, but the challenges are also inevitable. One of the major challenges that pose threat to the smart grids is cyber-attacks. A novel method is developed to detect false data injection (FDI) attacks in smart grids. The second novel application of SDFR is the spectrum sensing of MIMO-OFDM DSS systems. DSS is being implemented in the fifth generation of wireless communication systems (5G) to improve the spectrum efficiency. In a MIMO-OFDM system, not all the subcarriers are utilized simultaneously by the primary user (PU). Therefore, it is essential to sense the idle frequency bands and assign them to the secondary user (SU). The effectiveness of SDFR in capturing the spatio-temporal correlation of MIMO-OFDM time-series and predicting the availability of frequency bands in the future time slots is studied as well. In the third application, the SDFR is modified to be adopted in video-based face recognition. In this task, the SDFR is leveraged to recognize the identities of different subjects while they rotate their heads in different angles. Another contribution of this dissertation is to propose a novel encoding scheme of spiking neurons which is inspired by the cognitive studies of rats. For the first time, the multiplexing of multiple neural codes is introduced and it is shown that the robustness and resilience of the spiking neurons is increased against noisy data, and adversarial attacks, respectively. Adversarial attacks are small and imperceptible perturbations of the input data, which have shown to be able to fool deep learning (DL) models. So far, many adversarial attack and defense mechanisms have been introduced for DL models. Compromising the security and reliability of artificial intelligence (AI) systems is a major concern of government, industry and cyber-security researchers, in that insufficient protections can compromise the security and privacy of everyone in society. Finally, a defense mechanism to protect spiking neurons against adversarial attacks is introduced for the first time. In a nutshell, this dissertation presents a novel energy efficient deep spiking recurrent neural network which is inspired by delayed dynamic loops. The effectiveness of the introduced model is verified in several different applications. At the end, novel encoding and defense mechanisms are introduced which improve the robustness of the model against noise and adversarial attacks.
- Exploring the Boundaries of Operating System in the Era of Ultra-fast Storage TechnologiesRamanathan, Madhava Krishnan (Virginia Tech, 2023-05-24)The storage hardware is evolving at a rapid pace to keep up with the exponential rise of data consumption. Recently, ultra-fast storage technologies such as nano-second scale byte- addressable Non-Volatile Memory (NVM), micro-second scale SSDs are being commercialized. However, the OS storage stack has not been evolving fast enough to keep up with these new ultra-fast storage hardware. Hence, the latency due user-kernel context switch caused by system calls and hardware interrupts is no longer negligible as presumed in the era of slower high latency hard disks. Further, the OS storage stack is not designed with multi-core scalability in mind; so with CPU core count continuously increasing, the OS storage stack particularly the Virtual Filesystem (VFS) and filesystem layer are increasingly becoming a scalability bottleneck. Applications bypass the kernel (kernel-bypass storage stack) completely to eliminate the storage stack from becoming a performance and scalability bottleneck. But this comes at the cost of programmability, isolation, safety, and reliability. Moreover, scalability bottlenecks in the filesystem can not be addressed by simply moving the filesystem to the userspace. Overall, while designing a kernel-bypass storage stack looks obvious and promising there are several critical challenges in the aspects of programmability, performance, scalability, safety, and reliability that needs to be addressed to bypass the traditional OS storage stack. This thesis proposes a series of kernel-bypass storage techniques designed particularly for fast memory-centric storage. First, this thesis proposes a scalable persistent transactional memory (PTM) programming model to address the programmability and multi-core scalability challenges. Next, this thesis proposes techniques to make the PTM memory safe and fault tolerant. Further, this thesis also proposes a kernel-bypass programming framework to port legacy DRAM-based in-memory database applications to run on persistent memory-centric storage. Finally, this thesis explores an application-driven approach to address the CPU side and storage side bottlenecks in the deep learning model training by proposing a kernel-bypass programming framework to move to compute closer to the storage. Overall, the techniques proposed in this thesis will be a strong foundation for the applications to adopt and exploit the emerging ultra-fast storage technologies without being bottlenecked by the traditional OS storage stack.
- FiniteFuzz : Finite State Machine Fuzzer For Industrial Control IoT DevicesKaur, Jaskaran (Virginia Tech, 2023-07-03)Automated software testing techniques have become increasingly popular in recent years, with fuzzing being one of the most prevalent approaches. However, fuzzing Finite State Machines (FSMs) poses a significant challenge due to state and input dependency, resulting in exponential exploration time required to unlock the Finite State Machine. To address this issue, we present a novel approach in this research paper by introducing FINITEFUZZ, a Grey Box Fuzzer explicitly designed to fuzz Finite State Machines. Unlike the Blackbox fuzzers, FINITEFUZZ employs a mutational technique that utilizes feedback to steer the fuzzing process. FINITEFUZZ takes a random set of states and compares them with the desired FSM and records the states that increase the coverage of the Finite State Machine. The next seed incorporates the feedback received from all the previous seed inputs. This avoids exploring the same path multiple times and results in linear performance for all the types of Finite State machines possible. Our findings reveal that the use of FINITEFUZZ significantly reduces the exploration time required to uncover each state of the machine, making it a promising solution for generating Finite State Machines. We tested our FINITEFUZZ on 4 different types of Finite State Machines with each scenario resulting in at least 5X performance improvement in FSM generation. The potential applications of FSMs are vast, and our research suggests that the proposed approach can be used to generate any type of Finite State Machine.
- Implementing a RESTful Software Architecture to Coordinate Heterogeneous Networked Embedded DevicesDavis, Jason Tyler (Virginia Tech, 2021-10-27)Modern embedded systems---autonomous vehicle-to-vehicle communication, smart cities, and military Joint All-Domain Operations---feature increasingly heterogeneous distributed components. As a result, existing communication methods, tightly coupled with specific networking layers and individual applications, can no longer balance the flexibility of modern data distribution with the traditional constraints of embedded systems. To address this problem, the investigation herein presents a domain-specific language, designed around the Representational State Transfer (REST) architecture, most famously used on the web. Our language, called the Communication Language for Embedded Systems (CLES), supports both traditional point-to-point data communication and management and allocation of decentralized distributed processing tasks. To meet the traditional constraints of embedded execution, CLES' novel runtime allocates processing tasks across a heterogeneous network of embedded devices, overcoming limitations from other modern distribution methods: centralized task management and limited operating system integration. CLES was evaluated with performance micro-benchmarks, implementation of distributed stochastic gradient descent, and application to the design of versatile stateless services for vehicle-to-vehicle communication and military Joint All-Domain Command and Control (JDAC). From this evaluation, it was determined that CLES meets the data distribution needs of realistic cyber-physical embedded systems.
- Netswap: Network-based Swapping for Server-Embedded Board ClustersErrabelly, Sandeep (Virginia Tech, 2023-07-05)Capital equipment costs and energy costs are the major cost drivers in datacenters. Prior works have explored various techniques, like efficient scheduling algorithms and advanced power management techniques, to maximize resource utilization to reduce the capital and energy costs. The project HEXO has explored heterogeneous-Instruction Set Architecture (ISA) server-embedded clusters to minimize the cost. HEXO's key idea is to migrate stateful virtual machines from high-performance x86-based servers to low-power, low-cost ARM-based embedded boards, reducing server's resource congestion and thereby improving throughput and energy efficiency. However, embedded boards generally have significantly lower onboard memory, typically in the range of 100MB to 4GB. Due to this limitation, high memory-demand applications cannot be migrated to embedded devices. This limits the scope of applications that can be used with heterogeneous-ISA server-embedded clusters such as HEXO. This thesis proposes Netswap, a mechanism that utilizes the server's free memory as remote memory for the embedded board. Netswap comprises three main components: the swap-out and swap-in mechanism, a bitmap-based Free Memory Manager, and the Netswap Remote Daemon. Experimental studies using micro- and macro benchmarks reveal that Netswap improves the throughput and energy efficiency of server-embedded clusters by as much as 40% and 20%, respectively, over server-only baselines.
- Open-Source Parameterized Low-Latency Aggressive Hardware Compressor and Decompressor for Memory CompressionJearls, James Chandler (Virginia Tech, 2021-06-16)In recent years, memory has shown to be a constraining factor in many workloads. Memory is an expensive necessity in many situations, from embedded devices with a few kilobytes of SRAM to warehouse-scale computers with thousands of terabytes of DRAM. Memory compression has existed in all major operating systems for many years. However, while faster than swapping to a disk, memory decompression adds latency to data read operations. Companies and research groups have investigated hardware compression to mitigate these problems. Still, open-source low-latency hardware compressors and decompressors do not exist; as such, every group that studies hardware compression must re-implement. Importantly, because the devices that can benefit from memory compression vary so widely, there is no single solution to address all devices' area, latency, power, and bandwidth requirements. This work intends to address the many issues with hardware compressors and decompressors. This work implements hardware accelerators for three popular compression algorithms; LZ77, LZW, and Huffman encoding. Each implementation includes a compressor and decompressor, and all designs are entirely parameterized. There are a total of 22 parameters between the designs in this work. All of the designs are open-source under a permissive license. Finally, configurations of the work can achieve decompression latencies under 500 nanoseconds, much closer than existing works to the 255 nanoseconds required to read an uncompressed 4 KB page. The configurations of this work accomplish this while still achieving compression ratios comparable to software compression algorithms.
- Optimizing Systems for Deep Learning ApplicationsAlbahar, Hadeel Ahmad (Virginia Tech, 2023-03-01)Modern systems for Machine Learning (ML) workloads support heterogeneous workloads and resources. However, existing resource managers in these systems do not differentiate between heterogeneous GPU resources. Moreover, users are usually unaware of the sufficient and appropriate type and amount of GPU resources to request for their ML jobs. In this thesis, we analyze the performance of ML training and inference jobs and identify ML model and GPU characteristics that impact this performance. We then propose ML-based prediction models to accurately determine appropriate and sufficient resource requirements to ensure improved job latency and GPU utilization in the cluster.
- Punching Holes in the Cloud: Direct Communication between Serverless Functions Using NAT TraversalMoyer, Daniel William (Virginia Tech, 2021-06-04)A growing use for serverless computing is large parallel data processing applications that take advantage of its on-demand scalability. Because individual serverless compute nodes, which are called functions, run in isolated containers, a major challenge with this paradigm is transferring temporary computation data between functions. Previous works have performed inter-function communication using object storage, which is slow, or in-memory databases, which are expensive. We evaluate the use of direct network connections between functions to overcome these limitations. Although function containers block incoming connections, we are able to bypass this restriction using standard NAT traversal techniques. By using an external server, we implement TCP hole punching to establish direct TCP connections between functions. In addition, we develop a communications framework to manage NAT traversal and data flow for applications using direct network connections. We evaluate this framework with a reduce-by-key application compared to an equivalent version that uses object storage for communication. For a job with 100+ functions, our TCP implementation runs 4.7 times faster at almost half the cost.
- Rethinking Serverless for Machine Learning InferenceEllore, Anish Reddy (Virginia Tech, 2023-08-21)In the era of artificial intelligence and machine learning, AI/ML inference tasks have become exceedingly popular. However, executing these workloads on dedicated hardware may not be feasible for many users due to high maintenance costs, varying load patterns, and time to production. Furthermore, ML inference workloads are stateless, and most of them are not extremely latency sensitive. For example, tasks such as fake review removal, abusive language detection, tweet classification, image tagging, and free-tier-chat-bots do not require real-time inference. All these characteristics make serverless platforms a good fit for deployment, and in this work, we identify the bottlenecks involved in hosting these inference jobs on serverless and optimize serverless for better performance and resource utilization. Specifically, we identify model loading and model memory duplication as major bottlenecks in Serverless Inference, and to address these problems, we propose a new approach that rethinks the way we serve FaaS requests. To support this design, we employ a hybrid scaling approach to implement the autoscale feature of serverless.
- Scalability Analysis and Optimization for Large-Scale Deep LearningPumma, Sarunya (Virginia Tech, 2020-02-03)Despite its growing importance, scalable deep learning (DL) remains a difficult challenge. Scalability of large-scale DL is constrained by many factors, including those deriving from data movement and data processing. DL frameworks rely on large volumes of data to be fed to the computation engines for processing. However, current hardware trends showcase that data movement is already one of the slowest components in modern high performance computing systems, and this gap is only going to increase in the future. This includes data movement needed from the filesystem, within the network subsystem, and even within the node itself, all of which limit the scalability of DL frameworks on large systems. Even after data is moved to the computational units, managing this data is not easy. Modern DL frameworks use multiple components---such as graph scheduling, neural network training, gradient synchronization, and input pipeline processing---to process this data in an asynchronous uncoordinated manner, which results in straggler processes and consequently computational imbalance, further limiting scalability. This thesis studies a subset of the large body of data movement and data processing challenges that exist in modern DL frameworks. For the first study, we investigate file I/O constraints that limit the scalability of large-scale DL. We first analyze the Caffe DL framework with Lightning Memory-Mapped Database (LMDB), one of the most widely used file I/O subsystems in DL frameworks, to understand the causes of file I/O inefficiencies. Based on our analysis, we propose LMDBIO---an optimized I/O plugin for scalable DL that addresses the various shortcomings in existing file I/O for DL. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on 9,216 CPUs of the Blues and Bebop supercomputers at Argonne National Laboratory. Our second study deals with the computational imbalance problem in data processing. For most DL systems, the simultaneous and asynchronous execution of multiple data-processing components on shared hardware resources causes these components to contend with one another, leading to severe computational imbalance and degraded scalability. We propose various novel optimizations that minimize resource contention and improve performance by up to 35% for training various neural networks on 24,576 GPUs of the Summit supercomputer at Oak Ridge National Laboratory---the world's largest supercomputer at the time of writing of this thesis.
- Scalable and Productive Data Management for High-Performance AnalyticsYoussef, Karim Yasser Mohamed Yousri (Virginia Tech, 2023-11-07)Advancements in data acquisition technologies across different domains, from genome sequencing to satellite and telescope imaging to large-scale physics simulations, are leading to an exponential growth in dataset sizes. Extracting knowledge from this wealth of data enables scientific discoveries at unprecedented scales. However, the sheer volume of the gathered datasets is a bottleneck for knowledge discovery. High-performance computing (HPC) provides a scalable infrastructure to extract knowledge from these massive datasets. However, multiple data management performance gaps exist between big data analytics software and HPC systems. These gaps arise from multiple factors, including the tradeoff between performance and programming productivity, data growth at a faster rate than memory capacity, and the high storage footprints of data analytics workflows. This dissertation bridges these gaps by combining productive data management interfaces with application-specific optimizations of data parallelism, memory operation, and storage management. First, we address the performance-productivity tradeoff by leveraging Spark and optimizing input data partitioning. Our solution optimizes programming productivity while achieving comparable performance to the Message Passing Interface (MPI) for scalable bioinformatics. Second, we address the operating system's kernel limitations for out-of-core data processing by autotuning memory management parameters in userspace. Finally, we address I/O and storage efficiency bottlenecks in data analytics workflows that iteratively and incrementally create and reuse persistent data structures such as graphs, data frames, and key-value datastores.
- SHADE: Enable Fundamental Cacheability for Distributed Deep Learning TrainingKhan, Redwan; Yazdani, Ahmad; Fu, Yuqi; Paul, Arnab; Ji, Bo; Jian, Xun; Cheng, Yue; Butt, Ali (Usenix Association, 2023)Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose new challenges for storage system design. DLT is I/O intensive since data samples need to be fetched continuously from a remote storage. Accelerators such as GPUs have been extensively used to support these applications. As accelerators become more powerful and more data-hungry, the I/O performance lags behind. This creates a crucial performance bottleneck, especially in distributed DLT. At the same time, the exponentially growing dataset sizes make it impossible to store these datasets entirely in memory. While today’s DLT frameworks typically use a random sampling policy that treat all samples uniformly equally, recent findings indicate that not all samples are equally important and different data samples contribute differently towards improving the accuracy of a model. This observation creates an opportunity for DLT I/O optimizations by exploiting the data locality enabled by importance sampling. To this end, we design and implement SHADE, a new DLT-aware caching system that detects fine-grained importance variations at per-sample level and leverages the variance to make informed caching decisions for a distributed DLT job. SHADE adopts a novel, rank-based approach, which captures the relative importance of data samples across different minibatches. SHADE then dynamically updates the importance scores of all samples during training. With these techniques, SHADE manages to significantly improve the cache hit ratio of the DLT job, and thus, improves the job’s training performance. Evaluation with representative computer vision (CV) models shows that SHADE, with a small cache, improves the cache hit ratio by up to 4.5× compared to the LRU caching policy.
- Towards a Resource Efficient Framework for Distributed Deep Learning ApplicationsHan, Jingoo (Virginia Tech, 2022-08-24)Distributed deep learning has achieved tremendous success for solving scientific problems in research and discovery over the past years. Deep learning training is quite challenging because it requires training on large-scale massive dataset, especially with graphics processing units (GPUs) in latest high-performance computing (HPC) supercomputing systems. HPC architectures bring different performance trends in training throughput compared to the existing studies. Multiple GPUs and high-speed interconnect are used for distributed deep learning on HPC systems. Extant distributed deep learning systems are designed for non-HPC systems without considering efficiency, leading to under-utilization of expensive HPC hardware. In addition, increasing resource heterogeneity has a negative effect on resource efficiency in distributed deep learning methods including federated learning. Thus, it is important to focus on an increasing demand for both high performance and high resource efficiency for distributed deep learning systems, including latest HPC systems and federated learning systems. In this dissertation, we explore and design novel methods and frameworks to improve resource efficiency of distributed deep learning training. We address the following five important topics: performance analysis on deep learning for supercomputers, GPU-aware deep learning job scheduling, topology-aware virtual GPU training, heterogeneity-aware adaptive scheduling, and token-based incentive algorithm. In the first chapter (Chapter 3), we explore and focus on analyzing performance trend of distributed deep learning on latest HPC systems such as Summitdev supercomputer at Oak Ridge National Laboratory. We provide insights by conducting a comprehensive performance study on how deep learning workloads have effects on the performance of HPC systems with large-scale parallel processing capabilities. In the second part (Chapter 4), we design and develop a novel deep learning job scheduler MARBLE, which considers efficiency of GPU resource based on non-linear scalability of GPUs in a single node and improves GPU utilization by sharing GPUs with multiple deep learning training workloads. The third part of this dissertation (Chapter 5) proposes topology-aware virtual GPU training systems TOPAZ, specifically designed for distributed deep learning on recent HPC systems. In the fourth chapter (Chapter 6), we conduct exploration on an innovative holistic federated learning scheduling that employs a heterogeneity-aware adaptive selection method for improving resource efficiency and accuracy performance, coupled with resource usage profiling and accuracy monitoring to achieve multiple goals. In the fifth part of this dissertation (Chapter 7), we are focused on how to provide incentives to participants according to contribution for reaching high performance of final federated model, while tokens are used as a means of paying for the services of providing participants and the training infrastructure.