Browsing by Author "Jian, Xun"
Now showing 1 - 18 of 18
Results Per Page
Sort Options
- Circuit Support for Practical and Performant Batteryless SystemsWilliams, Harrison Ridgway (Virginia Tech, 2024-06-03)Tiny, ultra-low-power embedded processors enable sophisticated computing deployments in a myriad of areas previously off limits to computing power, ranging from intelligent medical implants to massive scale 'smart dust'-type sensing deployments. While today's computing and sensing hardware is well-suited for these next generation deployments, the batteries powering them are not: the size and weight of today's mobile and Internet-of-Things devices are dominated by their batteries, which also limit systems' lifespans and potential for deployment in sensitive contexts. Academic efforts have demonstrated the feasibility of harvesting energy on-demand from the environment as a practical alternative to classical battery power, instead buffering harvested energy in a capacitor to power intermittent bursts of operation. Energy harvesting circuits are miniaturizable, inexpensive, and enable effectively indefinite operation when compared to batteries---but introduce new problems stemming from the lack of a reliable power source. Unfortunately, these problems have so far confined batteryless systems to small-scale research deployments. The central design challenge for effective batteryless operation is efficiently using scarce input power from the energy harvesting frontend. Despite advances in both harvester and processor efficiency, digital systems often consume orders of magnitude more power than can be supplied by harvesting circuits---forcing systems to operate in short bursts punctuated by power failure and a long recharge period. Today's batteryless systems pay a steep price to sustain operation across these common-case power losses: current platforms depend on high-performance non-volatile memory to quickly and efficiently checkpoint program state before power loss, limiting batteryless operation to a small selection of devices which integrate these novel memory technologies. Choosing exactly when to checkpoint to non-volatile memory represents a challenge in itself: the hardware required to detect impending power failure often represents a large proportion of the system's overall energy consumption, forcing designers to choose between the energy overhead of voltage monitoring or the runtime overhead of 'energy-oblivious' checkpointing models. Finally, the choice of buffer capacitor size has a large impact on overall energy efficiency---but the optimal choice depends on runtime energy dynamics which are difficult to predict at design time, leaving designers to make at best educated guesses about future environmental conditions. This work approaches energy harvesting system design from a circuits perspective, answering the following research questions towards practical and performant batteryless operation: 1. Can the emergent properties of today's low-power systems be used to enable efficient intermittent operation on new classes of devices? 2. What compromises can we make in voltage monitor design to minimize power consumption while maintaining just enough functionality for batteryless operation? 3. How can we buffer harvested energy in a way that maximizes energy efficiency despite unpredictable system-level power dynamics? This work answers the following questions by producing the following research artifacts: 1. The first non-volatile memory invariant system to enable intermittent operation on embedded devices lacking high-performance memory (Chapter 2). 2. The first voltage monitoring circuit designed for batteryless systems to enable energy-aware operation without sacrificing efficiency (Chapter 3). 3. The first highly efficient power-adaptive energy buffer to store harvested energy without compromising on efficiency or performance (Chapter 4).
- The Client Insourcing Refactoring to Facilitate the Re-engineering of Web-Based ApplicationsAn, Kijin (Virginia Tech, 2021-05-19)Developers often need to re-engineer distributed applications to address changes in requirements, made only after deployment. Much of the complexity of inspecting and evolving distributed applications lies in their distributed nature, while the majority of mature program analysis and transformation tools works only with centralized software. Inspired by business process re-engineering, in which remote operations can be insourced back in house to restructure and outsource anew, this dissertation brings an analogous approach to the re-engineering of distributed applications. Our approach introduces a novel automatic refactoring---Client Insourcing---that creates a semantically equivalent centralized version of a distributed application. This centralized version is then inspected, modified, and redistributed to meet new requirements. This dissertation demonstrates the utility of Client Insourcing in helping meet the changed requirements in performance, reliability, and security. We implemented Client Insourcing in the important domain of full-stack JavaScript applications, in which both the client and server parts are written in JavaScript, and applied our implementation to re-engineer mobile web applications. Client Insourcing reduces the complexity of inspecting and evolving distributed applications, thereby facilitating their re-engineering. This dissertation is based on 4 conference papers and 2 doctoral symposium papers, presented at ICWE 2019, SANER 2020, WWW 2020, and ICWE 2021.
- Design and prototyping of Hardware-Accelerated Locality-aware Memory CompressionSrinivas, Raghavendra (Virginia Tech, 2020-09-09)Hardware Acceleration is the most sought technique in chip design to achieve better performance and power efficiency for critical functions that may be in-efficiently handled from traditional OS/software. As technology started advancing with 7nm products already in the market which can provide better power and performance consuming low area, the latency-critical functions that were handled by software traditionally now started moving as acceleration units in the chip. This thesis describes the accelerator architecture, implementation, and prototype for one of such functions namely "Locality-Aware memory compression" which is part of the "OS-controlled memory compression" scheme that has been actively deployed in today's OSes. In brief, OS-controlled memory compression is a new memory management feature that transparently, dramatically, and adaptively increases effective main memory capacity on-demand as software-level memory usage increases beyond physical memory system capacity. OS-controlled memory compression has been adopted across almost all OSes (e.g., Linux, Windows, macOS, AIX) and almost all classes of computing systems (e.g., smartphones, PCs, data centers, and cloud). The OS-controlled memory compression scheme is Locality Aware. But still under OS-controlled memory compression today, applications experience long-latency page faults when accessing compressed memory. To solve this per- performance bottle-neck, acceleration technique has been proposed to manage "Locality Aware Memory compression" within hardware thereby enabling applications to access their OS- compressed memory directly. This Accelerator is referred to as HALK throughout this work, which stands for "Hardware-accelerated Locality-aware Memory Compression". The literal mean- ing of the word HALK in English is 'a hidden place'. As such, this accelerator is neither exposed to the OS nor to the running applications. It is hidden entirely in the memory con- troller hardware and incurs minimal hardware cost. This thesis work explores developing FPGA design prototype and gives the proof of concept for the functionality of HALK by running non-trivial micro-benchmarks. This work also provides and analyses power, performance, and area of HALK for ASIC designs (at technology node of 7nm) and selected FPGA Prototype design.
- Detecting Persistence Bugs from Non-volatile Memory Programs by Inferring Likely-correctness ConditionsFu, Xinwei (Virginia Tech, 2022-03-10)Non-volatile main memory (NVM) technologies are revolutionizing the entire computing stack thanks to their storage-and-memory-like characteristics. The ability to persist data in memory provides a new opportunity to build crash-consistent software without paying a storage stack I/O overhead. A crash-consistent NVM program can recover back to a consistent state from a persistent NVM in the event of a software crash or a sudden power loss. In the presence of a volatile cache, data held in a volatile cache is lost after a crash. So NVM programming requires users to manually control the durability and the persistence ordering of NVM writes. To avoid performance overhead, developers have devised customized persistence mechanisms to enforce proper persistence ordering and atomicity guarantees, rendering NVM programs error-prone. The problem statement of this dissertation is how one can effectively detect persistence bugs from NVM programs. However, detecting persistence bugs in NVM programs is challenging because of the huge test space and the manual consistency validation required. The thesis of this dissertation is that we can detect persistence bugs from NVM programs in a scalable and automatic manner by inferring likely-correctness conditions from programs. A likely-correctness condition is a possible correctness condition, which is a condition a program must maintain to make the program crash-consistent. This dissertation proposes to infer two forms of likely-correctness conditions from NVM programs to detect persistence bugs. The first proposed solution is to infer likely-ordering and likely-atomicity conditions by analyzing program dependencies among NVM accesses. The second proposed solution is to infer likely-linearization points to understand a program's operation-level behavior. Using these two forms of likely-correctness conditions, we test only those NVM states and thread interleavings that violate the likely-correctness conditions. This significantly re- duces the test space required to examine. We then leverage the durable linearizability model to validate consistency automatically without manual consistency validation. In this way, we can detect persistence bugs from NVM programs in a scalable and automatic manner. In total, we detect 47 (36 new) persistence correctness bugs and 158 (113 new) persistence performance bugs from 20 single-threaded NVM programs. Additionally, we detect 27 (15 new) persistence correctness bugs from 12 multi-threaded NVM data structures.
- Exploring Per-Input Filter Selection and Approximation Techniques for Deep Neural NetworksGaur, Yamini (Virginia Tech, 2019-06-21)We propose a dynamic, input dependent filter approximation and selection technique to improve the computational efficiency of Deep Neural Networks. The approximation techniques convert 32 bit floating point representation of filter weights in neural networks into smaller precision values. This is done by reducing the number of bits used to represent the weights. In order to calculate the per-input error between the trained full precision filter weights and the approximated weights, a metric called Multiplication Error (ME) has been chosen. For convolutional layers, ME is calculated by subtracting the approximated filter weights from the original filter weights, convolving the difference with the input and calculating the grand-sum of the resulting matrix. For fully connected layers, ME is calculated by subtracting the approximated filter weights from the original filter weights, performing matrix multiplication between the difference and the input and calculating the grand-sum of the resulting matrix. ME is computed to identify approximated filters in a layer that result in low inference accuracy. In order to maintain the accuracy of the network, these filters weights are replaced with the original full precision weights. Prior work has primarily focused on input independent (static) replacement of filters to low precision weights. In this technique, all the filter weights in the network are replaced by approximated filter weights. This results in a decrease in inference accuracy. The decrease in accuracy is higher for more aggressive approximation techniques. Our proposed technique aims to achieve higher inference accuracy by not approximating filters that generate high ME. Using the proposed per-input filter selection technique, LeNet achieves an accuracy of 95.6% with 3.34% drop from the original accuracy value of 98.9% for truncating to 3 bits for the MNIST dataset. On the other hand upon static filter approximation, LeNet achieves an accuracy of 90.5% with 8.5% drop from the original accuracy. The aim of our research is to potentially use low precision weights in deep learning algorithms to achieve high classification accuracy with less computational overhead. We explore various filter approximation techniques and implement a per-input filter selection and approximation technique that selects the filters to approximate during run-time.
- Gate-level Leakage Assessment and MitigationKathuria, Tarun (Virginia Tech, 2019-07-22)Side-channel leakage, caused by imperfect implementation of cryptographic algorithms in hardware, has become a serious security threat for connected devices that generate and process sensitive data. This side-channel leakage can divulge secret information in the form of power consumption or electromagnetic emissions. The side-channel leakage of a crytographic device is commonly assessed after tape-out on a physical prototype. This thesis presents a methodology called Gate-level Leakage Assessment (GLA), which evaluates the power-based side-channel leakage of an integrated circuit at design time. By combining side-channel leakage assessment with power simulations on the gate-level netlist, GLA is able to pinpoint the leakiest cells in the netlist in addition to assessing the overall side-channel vulnerability to side-channel leakage. As the power traces obtained from power simulations are noiseless, GLA is able to precisely locate the sources of side-channel leakage with fewer measurements than on a physical prototype. The thesis applies the methodology on the design of a encryption co-processor to analyze sources of side-channel leakage. Once the gate-level leakage sources are identified, this thesis presents a logic level replacement strategy for the leakage sources that can thwart side-channel leakage. The countermeasures presented selectively replaces gate-level cells with a secure logic style effectively removing the side-channel leakage with minimal impact in area. The assessment methodology along with the countermeasures demonstrated is a turnkey solution for IP module designers and is also applicable to larger system level designs.
- Hiding Decryption Latency in Intel SGX using Metadata PredictionTalapkaliyev, Daulet (Virginia Tech, 2020-01-20)Hardware-Assisted Trusted Execution Environment technologies have become a crucial component in providing security for cloud-based computing. One of such hardware-assisted countermeasures is Intel Software Guard Extension (SGX). Using additional dedicated hardware and a new set of CPU instructions, SGX is able to provide isolated execution of code within trusted hardware containers called enclaves. By utilizing private encrypted memory and various integrity authentication mechanisms, it can provide confidentiality and integrity guarantees to protected data. In spite of dedicated hardware, these extra layers of security add a significant performance overhead. Decryption of data using secret OTPs, which are generated by modified Counter Mode Encryption AES blocks, results in a significant latency overhead that contributes to the overall SGX performance loss. This thesis introduces a metadata prediction extension to SGX based on local metadata releveling and prediction mechanisms. Correct prediction of metadata allows to speculatively precompute OTPs, which can be immediately used in decryption of incoming ciphertext data. This hides a significant part of decryption latency and results in faster SGX performance without any changes to the original SGX security guarantees.
- Impact of Increased Cache Misses on Runtime Performance of MPX-enabled ProgramsSharma, Niti (Virginia Tech, 2019-06-10)Low level languages like C and C++ provide high performance and direct control over memory management. But these languages are prone to memory safety violations. Intel introduced a new ISA extension-Memory Protection Extension(MPX), a hardware-assisted full-stack solution, to protect against the memory safety violations. While MPX efficiently prevents memory errors like buffer overflows and out of bound memory accesses, it comes at the cost of high performance overheads. Also, the cache locality worsens in MPX protected applications. In our research, we analyze if there is a correlation between increase in cache misses and runtime degradation in programs compiled with MPX support. We analyze 15 SPEC CPU benchmark programs for different input sizes on Windows platform, compiled with Intel's ICC compiler. We find that for input sizes train(medium) and ref(large), the average performance overheads are 140% and 144% respectively. We find that 5 out of 15 benchmarks do not have any runtime overheads and also, do not have any change in cache misses at any level. However for rest of the 10 benchmarks, we find a strong correlation between runtime overheads and cache misses overheads, with the correlation coefficients ranging from 0.8 to 0.36 for different input sizes. Based on our findings, we conclude that there is a direct correlation between runtime overheads and increase in cache misses. We also find that instructions overheads and runtime overheads have a positive correlation, with the coefficient values ranging from 0.7 to 0.33 for different input sizes.
- Memory Turbo Boost: Architectural Support for Using Unused Memory for Memory Replication to Boost Server Memory PerformanceZhang, Da (Virginia Tech, 2023-06-28)A significant portion of the memory in servers today is often unused. Our large-scale study of HPC systems finds that more than half of the total memory in active nodes running user jobs are unused for 88% of the time. Google and Azure Cloud studies also report unused memory accounts for 40% of the total memory in their servers, on average. Leaving so much memory unused is wasteful. To address this problem, we note that in the context of CPUs, Turbo Boost can turn off the unused cores to boost the performance of in-use cores. However, there is no equivalent technology in the context of memory; no matter how much memory is unused, the performance of in-use memory remains the same. This dissertation explores architectural techniques to utilize the unused memory to boost the performance of in-use memory and refer to them collectively as Memory Turbo Boost. This dissertation explores how to turbo boost memory performance through memory replication; specifically, it explores how to efficiently store the replicas in the unused memory and explores multiple architectural techniques to utilize the replicas to enhance memory system performance. Performance simulations show that Memory Turbo Boost can improve node-level performance by 18%, on average across a wide spectrum of workloads. Our system-wide simulations show applying Memory Turbo Boost to an HPC system provides 1.4x average speedup on job turnaround time.
- Nonblocking Memory RefreshNguyen, Kate Vy Hoang (Virginia Tech, 2018-08-08)Since its inception half a century ago, DRAM has required dynamic/active refresh operations that block read requests and decrease performance. We propose refreshing DRAM in the background without stalling read accesses to refreshing memory blocks, similar to the static/background refresh in SRAM. Our proposed Nonblocking Refresh works by refreshing a portion of the data in a memory block at a time and uses redundant data, such as Reed-Solomon codes, in the block to compute the block's refreshing/unreadable data to satisfy read requests. For proof of concept, we apply Nonblocking Refresh to server memory systems, where every memory block already contains redundant data to provide hardware failure protection. In this context, Nonblocking Refresh can utilize server memory system's existing per-block redundant data in the common-case when there are no hardware faults to correct, without requiring any dedicated redundant data of its own. Our evaluations show that on average across five server memory systems with different redundancy and failure protection strengths, Nonblocking Refresh improves performance by 16.2% and 30.3% for 16gb and 32gb DRAM chips, respectively.
- On Optimizing and Leveraging Distributed Shared Memory for High Performance, Resource Aggregation, and Cache-coherent Heterogeneous-ISA ProcessorsChuang, Ho-Ren (Virginia Tech, 2022-06-28)This dissertation focuses on the problem space of heterogeneous-ISA multiprocessors – an architectural design point that is being studied by the academic research community and increasingly available in commodity systems. Since such architectures usually lack globally coherent shared memory, software-based distributed shared memory (DSM) is often used to provide the illusion of such a memory. The DSM abstraction typically provides this illusion using a reader-replicate, writer-invalidate memory consistency protocol that operates at the granularity of memory pages and is usually implemented as a first-class operating system abstraction. This enables symmetric multiprocessing (SMP) programming frameworks, augmented with a heterogeneous-ISA compiler, to use CPU cores of different ISAs for parallel computations as if they are of the same ISA, improving programmability, especially for legacy SMP applications which therefore can run unmodified on such hardware. Past DSMs have been plagued by poor performance, in part due to the high latency and low bandwidth of interconnect network infrastructures. The dissertation revisits DSM in light of modern interconnects that reverse this performance trend. The dissertation presents Xfetch, a bulk page prefetching mechanism designed for the DEX DSM system. Xfetch exploits spatial locality, and aggressively and sequentially prefetches pages before potential read faults, improving DSM performance. Our experimental evaluations reveal that Xfetch achieves up to ≈142% speedup over the baseline DEX DSM that does not prefetch page data. SMP programming models often allow primitives that permit weaker memory consistency semantics, where synchronization updates can be delayed, permitting greater parallelism and thereby higher performance. Inspired by such primitives, the dissertation presents a DSM protocol called MWPF that trades-off memory consistency for higher performance in select SMP code regions, targeting heterogeneous-ISA multiprocessor systems. MWPF also overcomes performance bottlenecks of past DSM systems for heterogeneous-ISA multiprocessors such as due to significant number of invalidation messages, false page sharing, large number of read page faults, and large synchronization overheads by using efficient protocol primitives that delay and batch invalidation messages, aggressively prefetch data pages, and perform cross-domain synchronization with low overhead. Our experimental evaluations reveal that MWPF achieves, on average, 11% speedup over the baseline DSM implementation. The dissertation presents PuzzleHype, a distributed hypervisor that enables a single virtual machine (VM) to use fragmented resources in distributed virtualized settings such as CPU cores, memory, and devices of different physical hosts, and thereby decrease resource fragmentation and increase resource utilization. PuzzleHype leverages DSM implemented in host operating systems to present an unified and consistent view of a continuous pseudo-physical address space to guest operating systems. To transparently utilize CPU and I/O resources, PuzzleHype integrates multiple physical CPUs into a single VM by migrating threads, forwarding interrupts, and by delegating I/O. Our experimental evaluations reveal that PuzzleHype yields speedups in the range of 355%–173% over baseline over-provisioning scenarios which are otherwise necessary due to resource fragmentation. To enable a distributed hypervisor to adapt to resource and workload changes, the dissertation proposes the concept of CPU borrowing that allows a VM's virtual CPU (vCPU) to migrate to an available physical CPU (pCPU) and release it when it is no longer necessary, i.e., CPU returning. CPU borrowing can thus be used when a node is over-committed, and CPU returning can be used when the borrowed CPU resource is no longer necessary. To transparently migrate a vCPU at runtime without incurring a significant downtime, the dissertation presents a suite of techniques including leveraging thread migration, loading/restoring vCPU in KVM states, maintaining a global vCPU location table, and creating a DSM kernel thread for handling on-demand paging. Our experimental evaluations reveal that migrating vCPUs to resource-available nodes achieves a speedup of 1.4x over running the vCPUs on distributed nodes. When a VM spans multiple nodes, it is likelihood for failure increases. To mitigate this, the dissertation presents a distributed checkpoint/restart mechanism that allows a distributed VM to tolerate failures. A user interface is introduced for sending/receiving checkpoint/restart commands to a distributed VM. We implement the checkpoint/restart technique in the native KVM tool, and extend it to a distributed mode by converting Inter-Process Communication (IPC) into message passing between nodes, pausing/resuming distributed vCPU executions, and loading/restoring runtime states on the correct set of nodes. Our experimental evaluations indicate that the overhead of checkpointing a distributed VM is ≈10% or less than that of the native KVM tool with our checkpoint support. Restarting a distributed VM is faster than native KVM with our restart support because no additional page faults occur during restarting. The dissertation's final contribution is PopHype, a system software stack that allows simulation of cache-coherent, shared memory heterogeneous-ISA hardware. PopHype includes a Linux operating system that implements DSM as an OS abstraction for processes, i.e., allows multiple processes running on multiple (ISA-different) machines to share memory. With KVM-enabled, this OS becomes a hypervisor that allows multiple, process-based instances of an architecture emulator such as QEMU to execute in a shared address space, allowing multiple QEMU instances to emulate different ISAs in shared memory, i.e., emulate shared memory heterogeneous-ISA hardware. PopHype also includes a modified QEMU to use process-level DSM and an optimized guest OS kernel for improved performance. Our experimental studies confirm PopHype's effectiveness, and reveal that PopHype achieves an average speedup of 7.32x over a baseline that runs multiple QEMU instances in shared memory atop a single host OS.
- Open-Source Parameterized Low-Latency Aggressive Hardware Compressor and Decompressor for Memory CompressionJearls, James Chandler (Virginia Tech, 2021-06-16)In recent years, memory has shown to be a constraining factor in many workloads. Memory is an expensive necessity in many situations, from embedded devices with a few kilobytes of SRAM to warehouse-scale computers with thousands of terabytes of DRAM. Memory compression has existed in all major operating systems for many years. However, while faster than swapping to a disk, memory decompression adds latency to data read operations. Companies and research groups have investigated hardware compression to mitigate these problems. Still, open-source low-latency hardware compressors and decompressors do not exist; as such, every group that studies hardware compression must re-implement. Importantly, because the devices that can benefit from memory compression vary so widely, there is no single solution to address all devices' area, latency, power, and bandwidth requirements. This work intends to address the many issues with hardware compressors and decompressors. This work implements hardware accelerators for three popular compression algorithms; LZ77, LZW, and Huffman encoding. Each implementation includes a compressor and decompressor, and all designs are entirely parameterized. There are a total of 22 parameters between the designs in this work. All of the designs are open-source under a permissive license. Finally, configurations of the work can achieve decompression latencies under 500 nanoseconds, much closer than existing works to the 255 nanoseconds required to read an uncompressed 4 KB page. The configurations of this work accomplish this while still achieving compression ratios comparable to software compression algorithms.
- Prototyping Hardware-compressed Memory for Multi-tenant SystemsLiu, Yuqing (Virginia Tech, 2023-10-18)Software memory compression has been a common practice among operating systems. Since then, prior works have explored hardware memory compression to reduce the load on the CPU by offloading memory compression to hardware. However, prior works on hardware memory compression cannot provide critical isolation in multi-tenant systems like cloud servers. Our evaluation of prior work (TMCC) shows that a tenant can be slowed down by more than 12x due to the lack of isolation. This work, Compressed Memory Management Unit (CMMU), prototypes hardware compression for multi-tenant systems. CMMU provides critical isolation for multi-tenant systems.First, CMMU allows OS to control individual tenants' usage of physical memory. Second, CMMU compresses a tenant's memory to an OS-specified physical usage target. Finally, CMMU notifies the OS to start swapping the memory to the storage if it fails to compress the memory to the target. We prototype CMMU with a real compression module on an FPGA board. CMMU runs with a Linux kernel modified to support CMMU. The prototype virtually expands the memory capacity to 4X. CMMU stably supports the modified Linux kernel with multiple tenants and applications. While achieving this, CMMU only requires several extra cycles of overhead besides the essential data structure accesses. ASIC synthesis results show CMMU fits within 0.00931mm2 of silicon and operates at 3GHz while consuming 36.90mW of power. It is a negligible cost to modern server systems.
- SHADE: Enable Fundamental Cacheability for Distributed Deep Learning TrainingKhan, Redwan; Yazdani, Ahmad; Fu, Yuqi; Paul, Arnab; Ji, Bo; Jian, Xun; Cheng, Yue; Butt, Ali (Usenix Association, 2023)Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose new challenges for storage system design. DLT is I/O intensive since data samples need to be fetched continuously from a remote storage. Accelerators such as GPUs have been extensively used to support these applications. As accelerators become more powerful and more data-hungry, the I/O performance lags behind. This creates a crucial performance bottleneck, especially in distributed DLT. At the same time, the exponentially growing dataset sizes make it impossible to store these datasets entirely in memory. While today’s DLT frameworks typically use a random sampling policy that treat all samples uniformly equally, recent findings indicate that not all samples are equally important and different data samples contribute differently towards improving the accuracy of a model. This observation creates an opportunity for DLT I/O optimizations by exploiting the data locality enabled by importance sampling. To this end, we design and implement SHADE, a new DLT-aware caching system that detects fine-grained importance variations at per-sample level and leverages the variance to make informed caching decisions for a distributed DLT job. SHADE adopts a novel, rank-based approach, which captures the relative importance of data samples across different minibatches. SHADE then dynamically updates the importance scores of all samples during training. With these techniques, SHADE manages to significantly improve the cache hit ratio of the DLT job, and thus, improves the job’s training performance. Evaluation with representative computer vision (CV) models shows that SHADE, with a small cache, improves the cache hit ratio by up to 4.5× compared to the LRU caching policy.
- Synthesizing a Hybrid Benchmark Suite with BenchPrimeWu, Xiaolong (Virginia Tech, 2018-10-09)This paper presents BenchPrime, an automated benchmark analysis toolset that is systematic and extensible to analyze the similarity and diversity of benchmark suites. BenchPrime takes multiple benchmark suites and their evaluation metrics as inputs and generates a hybrid benchmark suite comprising only essential applications. Unlike prior work, BenchPrime uses linear discriminant analysis rather than principal component analysis, as well as selects the best clustering algorithm and the optimized number of clusters in an automated and metric-tailored way, thereby achieving high accuracy. In addition, BenchPrime ranks the benchmark suites in terms of their application set diversity and estimates how unique each benchmark suite is compared to other suites. As a case study, this work for the first time compares the DenBench with the MediaBench and MiBench using four different metrics to provide a multi-dimensional understanding of the benchmark suites. For each metric, BenchPrime measures to what degree DenBench applications are irreplaceable with those in MediaBench and MiBench. This provides means for identifying an essential subset from the three benchmark suites without compromising the application balance of the full set. The experimental results show that the necessity of including DenBench applications varies across the target metrics and that significant redundancy exists among the three benchmark suites.
- Towards a Resource Efficient Framework for Distributed Deep Learning ApplicationsHan, Jingoo (Virginia Tech, 2022-08-24)Distributed deep learning has achieved tremendous success for solving scientific problems in research and discovery over the past years. Deep learning training is quite challenging because it requires training on large-scale massive dataset, especially with graphics processing units (GPUs) in latest high-performance computing (HPC) supercomputing systems. HPC architectures bring different performance trends in training throughput compared to the existing studies. Multiple GPUs and high-speed interconnect are used for distributed deep learning on HPC systems. Extant distributed deep learning systems are designed for non-HPC systems without considering efficiency, leading to under-utilization of expensive HPC hardware. In addition, increasing resource heterogeneity has a negative effect on resource efficiency in distributed deep learning methods including federated learning. Thus, it is important to focus on an increasing demand for both high performance and high resource efficiency for distributed deep learning systems, including latest HPC systems and federated learning systems. In this dissertation, we explore and design novel methods and frameworks to improve resource efficiency of distributed deep learning training. We address the following five important topics: performance analysis on deep learning for supercomputers, GPU-aware deep learning job scheduling, topology-aware virtual GPU training, heterogeneity-aware adaptive scheduling, and token-based incentive algorithm. In the first chapter (Chapter 3), we explore and focus on analyzing performance trend of distributed deep learning on latest HPC systems such as Summitdev supercomputer at Oak Ridge National Laboratory. We provide insights by conducting a comprehensive performance study on how deep learning workloads have effects on the performance of HPC systems with large-scale parallel processing capabilities. In the second part (Chapter 4), we design and develop a novel deep learning job scheduler MARBLE, which considers efficiency of GPU resource based on non-linear scalability of GPUs in a single node and improves GPU utilization by sharing GPUs with multiple deep learning training workloads. The third part of this dissertation (Chapter 5) proposes topology-aware virtual GPU training systems TOPAZ, specifically designed for distributed deep learning on recent HPC systems. In the fourth chapter (Chapter 6), we conduct exploration on an innovative holistic federated learning scheduling that employs a heterogeneity-aware adaptive selection method for improving resource efficiency and accuracy performance, coupled with resource usage profiling and accuracy monitoring to achieve multiple goals. In the fifth part of this dissertation (Chapter 7), we are focused on how to provide incentives to participants according to contribution for reaching high performance of final federated model, while tokens are used as a means of paying for the services of providing participants and the training infrastructure.
- Towards Using Free Memory to Improve Microarchitecture PerformancePanwar, Gagandeep (Virginia Tech, 2020-05-18)A computer system's memory is designed to accommodate the worst-case workloads with the highest memory requirement; as such, memory is underutilized when a system runs workloads with common-case memory requirements. Through a large-scale study of four production HPC systems, we find that memory underutilization problem in HPC systems is very severe. As unused memory is wasted memory, we propose exposing a compute node's unused memory to its CPU(s) through a user-transparent CPU-OS codesign. This can enable many new microarchitecture techniques that transparently leverage unused memory locations to help improve microarchitecture performance. We refer to these techniques as Free-memory-aware Microarchitecture Techniques (FMTs). In the context of HPC systems, we present a detailed example of an FMT called Free-memory-aware Replication (FMR). FMR replicates in-use data to unused memory locations to effectively reduce average memory read latency. On average across five HPC benchmark suites, FMR provides 13% performance and 8% system-level energy improvement.
- Utilization-adaptive Memory ArchitecturesPanwar, Gagandeep (Virginia Tech, 2024-06-14)DRAM contributes significantly to a server system's cost and global warming potential. To make matters worse, DRAM density scaling has not kept up with the scaling in logic and storage technologies. An effective way to reduce DRAM's monetary and environmental cost is to increase its effective utilization and extract the best possible performance in all utilization scenarios. To this end, this dissertation proposes Utilization-adaptive Memory Architectures that enhance the memory controller with the ability to adapt to current memory utilization and implement techniques to boost system performance. These techniques fall under two categories: (i) The techniques under Utilization-adaptive Hardware Memory Replication target the scenario where memory is underutilized and aim to boost performance versus a conventional system without replication, and (ii) The techniques under Utilization-adaptive Hardware Memory Compression target the scenario where memory utilization is high and aim to significantly increase memory capacity while closing the performance gap versus a conventional system that has sufficient memory and does not require compression.