Browsing by Author "Jung, Changhee"
Now showing 1 - 20 of 24
Results Per Page
Sort Options
- Analysis and Enforcement of Properties in Software SystemsWu, Meng (Virginia Tech, 2019-07-02)Due to the lack of effective techniques for detecting and mitigating property violations, existing approaches to ensure the safety and security of software systems are often labor intensive and error prone. Furthermore, they focus primarily on functional correctness of the software code while ignoring micro-architectural details of the underlying processor, such as cache and speculative execution, which may undermine their soundness guarantees. To fill the gap, I propose a set of new methods and tools for ensuring the safety and security of software systems. Broadly speaking, these methods and tools fall into three categories. The first category is concerned with static program analysis. Specifically, I develop a novel abstract interpretation framework that considers both speculative execution and a cache model, and guarantees to be sound for estimating the execution time of a program and detecting side-channel information leaks. The second category is concerned with static program transformation. The goal is to eliminate side channels by equalizing the number of CPU cycles and the number of cache misses along all program paths for all sensitive variables. The third category is concerned with runtime safety enforcement. Given a property that may be violated by a reactive system, the goal is to synthesize an enforcer, called the shield, to correct the erroneous behaviors of the system instantaneously, so that the property is always satisfied by the combined system. I develop techniques to make the shield practical by handling both burst error and real-valued signals. The proposed techniques have been implemented and evaluated on realistic applications to demonstrate their effectiveness and efficiency.
- AnalyzeThis: An Analysis Workflow-Aware Storage SystemSim, Hyogi (Virginia Tech, 2014-12-17)Supercomputing application simulations on hundreds of thousands of cores produce vast amounts of data that need to be analyzed on smaller-scale clusters to glean insights. The process is referred to as an end-to-end workflow. Extant workflow systems are stymied by the storage wall, resulting from both the disk-based parallel file system (PFS) failing to keep pace with the compute and memory subsystems as well as the inefficiencies in end-to-end workflow processing. In the post-petaflop era, supercomputers are provisioned with flash devices, as an intermediary between compute nodes and the PFS, enabling novel paradigms not just for expediting I/O, but also for the in-situ analysis of the simulation output data on the flash device. An array of such active flash elements allows us to fundamentally rethink the way data analysis workflows interact with storage systems. By blending the flash storage array and data analysis together in a seamless fashion, we create an analysis workflow-aware storage system, AnalyzeThis. Our guiding principle is that analysis-awareness be deeply ingrained in each and every layer of the storage system—active flash fabric, analysis object abstraction layer, scheduling layer within the storage, and an easy-to-use file system interface—thereby elevating data analyses as first-class citizens. Together, these concepts transform AnalyzeThis into a potent analytics-aware appliance.
- AutoMatch: Automated Matching of Compute Kernels to Heterogeneous HPC ArchitecturesHelal, Ahmed E.; Feng, Wu-chun; Jung, Changhee; Hanafy, Yasser Y. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-12-13)HPC systems contain a wide variety of heterogeneous computing resources, ranging from general-purpose CPUs to specialized accelerators. Porting sequential applications to such systems for achieving high performance requires significant software and hardware expertise as well as extensive manual analysis of both the target architectures and applications to decide the best performing architecture and implementation technique for each application. To streamline this tedious process, this paper presents AutoMatch, a tool for automated matching of compute kernels to heterogeneous HPC architectures. AutoMatch analyzes the sequential application code and automatically predicts the performance of the best parallel implementation of its compute kernels on different hardware architectures. AutoMatch leverages such prediction results to identify the best device for each kernel from a set of devices including multi-core CPUs and many-core GPUs. In addition, it estimates the relative execution cost between the different architectures to drive a workload distribution scheme, which enables end users to efficiently exploit the available compute resources across multiple heterogeneous architectures. We demonstrate the efficacy of AutoMatch, using a set of open-source HPC applications and benchmarks with different parallelism profiles and memory-access patterns. The empirical evaluation shows that AutoMatch is highly accurate across five different heterogeneous architectures, identifying the best architecture for each workload in 96% of the test cases, and its workload distribution scheme has a comparable performance to a profiling-driven oracle.
- Automated Runtime Analysis and Adaptation for Scalable Heterogeneous ComputingHelal, Ahmed Elmohamadi Mohamed (Virginia Tech, 2020-01-29)In the last decade, there have been tectonic shifts in computer hardware because of reaching the physical limits of the sequential CPU performance. As a consequence, current high-performance computing (HPC) systems integrate a wide variety of compute resources with different capabilities and execution models, ranging from multi-core CPUs to many-core accelerators. While such heterogeneous systems can enable dramatic acceleration of user applications, extracting optimal performance via manual analysis and optimization is a complicated and time-consuming process. This dissertation presents graph-structured program representations to reason about the performance bottlenecks on modern HPC systems and to guide novel automation frameworks for performance analysis and modeling and runtime adaptation. The proposed program representations exploit domain knowledge and capture the inherent computation and communication patterns in user applications, at multiple levels of computational granularity, via compiler analysis and dynamic instrumentation. The empirical results demonstrate that the introduced modeling frameworks accurately estimate the realizable parallel performance and scalability of a given sequential code when ported to heterogeneous HPC systems. As a result, these frameworks enable efficient workload distribution schemes that utilize all the available compute resources in a performance-proportional way. In addition, the proposed runtime adaptation frameworks significantly improve the end-to-end performance of important real-world applications which suffer from limited parallelism and fine-grained data dependencies. Specifically, compared to the state-of-the-art methods, such an adaptive parallel execution achieves up to an order-of-magnitude speedup on the target HPC systems while preserving the inherent data dependencies of user applications.
- BenchPrime: Accurate Benchmark Subsetting with Optimized Clustering Algorithm SelectionLiu, Qingrui; Wu, Xiaolong; Kittinger, Larry; Levy, Markus; Jung, Changhee (Department of Computer Science, Virginia Polytechnic Institute & State University, 2018-08-24)This paper presents BenchPrime, an automated benchmark analysis toolset that is systematic and extensible to analyze the similarity and diversity of benchmark suites. BenchPrime takes multiple benchmark suites and their evaluation metrics as inputs and generates a hybrid benchmark suite comprising only essential applications. Unlike prior work, BenchPrime uses linear discriminant analysis rather than principal component analysis, as well as selects the best clustering algorithm and the optimized number of clusters in an automated and metric-tailored way, thereby achieving high accuracy. In addition, BenchPrime ranks the benchmark suites in terms of their application set diversity and estimates how unique each benchmark suite is compared to other suites. As a case study, this work for the first time compares the DenBench with the MediaBench and MiBench using four different metrics to provide a multi-dimensional understanding of the benchmark suites. For each metric, BenchPrime measures to what degree DenBench applications are irreplaceable with those in MediaBench and MiBench. This provides means for identifying an essential subset from the three benchmark suites without compromising the application balance of the full set. The experimental results show that the necessity of including DenBench applications varies across the target metrics and that significant redundancy exists among the three benchmark suites.
- Circuit Support for Practical and Performant Batteryless SystemsWilliams, Harrison Ridgway (Virginia Tech, 2024-06-03)Tiny, ultra-low-power embedded processors enable sophisticated computing deployments in a myriad of areas previously off limits to computing power, ranging from intelligent medical implants to massive scale 'smart dust'-type sensing deployments. While today's computing and sensing hardware is well-suited for these next generation deployments, the batteries powering them are not: the size and weight of today's mobile and Internet-of-Things devices are dominated by their batteries, which also limit systems' lifespans and potential for deployment in sensitive contexts. Academic efforts have demonstrated the feasibility of harvesting energy on-demand from the environment as a practical alternative to classical battery power, instead buffering harvested energy in a capacitor to power intermittent bursts of operation. Energy harvesting circuits are miniaturizable, inexpensive, and enable effectively indefinite operation when compared to batteries---but introduce new problems stemming from the lack of a reliable power source. Unfortunately, these problems have so far confined batteryless systems to small-scale research deployments. The central design challenge for effective batteryless operation is efficiently using scarce input power from the energy harvesting frontend. Despite advances in both harvester and processor efficiency, digital systems often consume orders of magnitude more power than can be supplied by harvesting circuits---forcing systems to operate in short bursts punctuated by power failure and a long recharge period. Today's batteryless systems pay a steep price to sustain operation across these common-case power losses: current platforms depend on high-performance non-volatile memory to quickly and efficiently checkpoint program state before power loss, limiting batteryless operation to a small selection of devices which integrate these novel memory technologies. Choosing exactly when to checkpoint to non-volatile memory represents a challenge in itself: the hardware required to detect impending power failure often represents a large proportion of the system's overall energy consumption, forcing designers to choose between the energy overhead of voltage monitoring or the runtime overhead of 'energy-oblivious' checkpointing models. Finally, the choice of buffer capacitor size has a large impact on overall energy efficiency---but the optimal choice depends on runtime energy dynamics which are difficult to predict at design time, leaving designers to make at best educated guesses about future environmental conditions. This work approaches energy harvesting system design from a circuits perspective, answering the following research questions towards practical and performant batteryless operation: 1. Can the emergent properties of today's low-power systems be used to enable efficient intermittent operation on new classes of devices? 2. What compromises can we make in voltage monitor design to minimize power consumption while maintaining just enough functionality for batteryless operation? 3. How can we buffer harvested energy in a way that maximizes energy efficiency despite unpredictable system-level power dynamics? This work answers the following questions by producing the following research artifacts: 1. The first non-volatile memory invariant system to enable intermittent operation on embedded devices lacking high-performance memory (Chapter 2). 2. The first voltage monitoring circuit designed for batteryless systems to enable energy-aware operation without sacrificing efficiency (Chapter 3). 3. The first highly efficient power-adaptive energy buffer to store harvested energy without compromising on efficiency or performance (Chapter 4).
- CommAnalyzer: Automated Estimation of Communication Cost on HPC Clusters Using Sequential CodeHelal, Ahmed E.; Jung, Changhee; Feng, Wu-chun; Hanafy, Yasser Y. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2017-08-14)MPI+X is the de facto standard for programming applications on HPC clusters. The performance and scalability on such systems is limited by the communication cost on different number of processes and compute nodes. Therefore, the current communication analysis tools play a critical role in the design and development of HPC applications. However, these tools require the availability of the MPI implementation, which might not exist in the early stage of the development process due to the parallel programming effort and time. This paper presents CommAnalyzer, an automated tool for communication model generation from a sequential code. CommAnalyzer uses novel compiler analysis techniques and graph algorithms to capture the inherent communication characteristics of sequential applications, and to estimate their communication cost on HPC systems. The experiments with real-world, regular and irregular scientific applications demonstrate the utility of CommAnalyzer in estimating the communication cost on HPC clusters with more than 95% accuracy on average.
- A Compiler Framework to Support and Exploit Heterogeneous Overlapping-ISA Multiprocessor PlatformsJelesnianski, Christopher Stanisław (Virginia Tech, 2015-10-28)As the demand for ever increasingly powerful machines continues, new architectures are sought to be the next route of breaking past the brick wall that currently stagnates the performance growth of modern multi-core CPUs. Due to physical limitations, scaling single-core performance any further is no longer possible, giving rise to modern multi-cores. However, the brick wall is now limiting the scaling of general-purpose multi-cores. Heterogeneous-core CPUs have the potential to continue scaling by reducing power consumption through exploitation of specialized and simple cores within the same chip. Heterogeneous-core CPUs join fundamentally different processors each which their own peculiar features, i.e., fast execution time, improved power efficiency, etc; enabling the building of versatile computing systems. To make heterogeneous platforms permeate the computer market, the next hurdle to overcome is the ability to provide a familiar programming model and environment such that developers do not have to focus on platform details. Nevertheless, heterogeneous platforms integrate processors with diverse characteristics and potentially a different Instruction Set Architecture (ISA), which exacerbate the complexity of the software. A brave few have begun to tread down the heterogeneous-ISA path, hoping to prove that this avenue will yield the next generation of super computers. However, many unforeseen obstacles have yet to be discovered. With this new challenge comes the clear need for efficient, developer-friendly, adaptable system software to support the efforts of making heterogeneous-ISA the golden standard for future high-performance and general-purpose computing. To foster rapid development of this technology, it is imperative to put the proper tools into the hands of developers, such as application and architecture profiling engines, in order to realize the best heterogeneous-ISA platform possible with available technology. In addition, it would be in the best interest to create tools to be as "timeless" as possible to expose fundamental concepts industry could benefit from and adopt in future designs. We demonstrate the feasibility of a compiler framework and runtime for an existing heterogeneous-ISA operating system (Popcorn Linux) for automatically scheduling compute blocks within an application on a given heterogeneous-ISA high-performance platform (in our case a platform built with Intel Xeon - Xeon Phi). With the introduced Profiler, Partitioner, and Runtime support, we prove to be able to automatically exploit the heterogeneity in an overlapping-ISA platform, being faster than native execution and other parallelism programming models. Empirically evaluating our compiler framework, we show that application execution on Popcorn Linux can be up to 52% faster than the most performant native execution for Xeon or Xeon Phi. Using our compiler framework relieves the developer from manual scheduling and porting of applications, requiring only a single profiling run per application.
- Compiler-Directed Error Resilience for Reliable ComputingLiu, Qingrui (Virginia Tech, 2018-08-08)Error resilience has become as important as power and performance in modern computing architecture. There are various sources of errors that can paralyze real-world computing systems. Of particular interest to this dissertation are single-event errors. They can be the results of energetic particle strike or abrupt power outage that corrupts the program states leading to system failures. Specifically, energetic particle strike is the major cause of soft error while abrupt power outage can result in memory inconsistency in the nonvolatile memory systems. Unfortunately, existing techniques to handle those single-event errors are either resource consuming (e.g., hardware approaches) or heavy-weight (e.g., software approaches). To address this problem, this dissertation identifies idempotent processing as an alternative recovery technique to handle the system failures in an efficient and low-cost manner. Then, this dissertation first proposes to design and develop a compiler-directed lightweight methodology which leverages idempotent processing and the state-of-the-art sensor-based detection to achieve soft error resilience at low-cost. This dissertation also introduces a lightweight soft error tolerant hardware design that redefines idempotent processing where the idempotent regions can be created, verified and recovered from the processor's point of view. Furthermore, this dissertation proposes a series of compiler optimizations that significantly reduce the hardware and runtime overhead of the idempotent processing. Lastly, this dissertation proposes a failure-atomic system integrated with idempotent processing to resolve another type of single-event error, i.e., failure-induced memory inconsistency in the nonvolatile memory systems.
- Compiler-Directed Failure Atomicity for Nonvolatile MemoryLiu, Qingrui; Izraelevitz, Joseph; Lee, Se Kwon; Scott, Michael L.; Noh, Sam H.; Jung, Changhee (Department of Computer Science, Virginia Polytechnic Institute & State University, 2019-07-15)This paper presents iDO, a compiler-directed approach to failure atomicity with nonvolatile memory. Unlike most prior work, which instruments each store of persistent data for redo or undo logging, the iDO compiler identifies idempotent instruction sequences, whose re-execution is guaranteed to be side effect-free, thereby eliminating the need to log every persistent store. Using an extension of prior work on JUSTDO logging, the compiler then arranges, during recovery from failure, to back up each thread to the beginning of the current idempotent region and re-execute to the end of the current failure-atomic section. This extension transforms JUSTDO logging from a technique of value only on hypothetical future machines with nonvolatile caches into a technique that also significantly outperforms state-of-the art lock-based persistence mechanisms on current hardware during normal execution, while preserving very fast recovery times.
- Design Optimization Techniques for Time-Critical Cyber-Physical SystemsZhao, Yecheng (Virginia Tech, 2020-01-20)Cyber-Physical Systems (CPS) are widely deployed in critical applications which are subject to strict timing constraints. To ensure correct timing behavior, much of the effort has been dedicated to the development of validation and verification methods for CPS (e.g., system models and their timing and schedulability analysis). As CPS is becoming increasingly complex, there is an urgent need for efficient optimization techniques that can aid the design of large-scale systems. Specifically, techniques that can find good design options in a reasonable amount of time while meeting all the timing and other critical requirements are becoming vital. However, the current mindset is to use existing schedulability analysis and optimization techniques for the design optimization of time-critical CPS. This has resulted in two issues in today's CPS design: 1) Existing timing and schedulability analysis are very difficult and inefficient to be integrated into well-established optimization frameworks such as mathematical programming; 2) New system models and timing analysis are being developed in a way that is increasingly unfriendly to optimization. Due to these difficulties, existing practice for optimization mostly relies on meta or ad-hoc heuristics, which suffers either from sub-optimality or limited applicability. In this dissertation, we seek to address these issues and explore two new directions for developing optimization algorithms for time-critical CPS. The first is to develop {em optimization-oriented timing analysis}, that are efficient to formulate in mathematical programming framework. The second is a domain-specific optimization framework. The framework leverages domain-specific knowledge to provide methods that abstract timing analysis into a simple mathematical form. This allows to efficiently handle the complexity of timing analysis in optimization algorithms. The results on a number of case studies show that the proposed approaches have the potential to significantly improve upon scalability (several orders of magnitude faster) and solution quality, while being applicable to various system models, timing analysis techniques, and design optimization problems in time-critical CPS.
- Designing Practical Software Bug Detectors Using Commodity Hardware and Common Programming PatternsZhang, Tong (Virginia Tech, 2020-01-13)Software bugs can cost millions and affect people's daily lives. However, many bug detection tools are not always practical in reality, which hinders their wide adoption. There are three main concerns regarding existing bug detectors: 1) run-time overhead in dynamic bug detectors, 2) space overhead in dynamic bug detectors, and 3) scalability and precision issues in static bug detectors. With those in mind, we propose to: 1) leverage commodity hardware to reduce run-time overhead, 2) reuse metadata maintained by one bug detector to detect other types of bugs, reducing space overhead, and 3) apply programming idioms to static analyses, improving scalability and precision. We demonstrate the effectiveness of three approaches using data race bugs, memory safety bugs, and permission check bugs, respectively. First, we leverage the commodity hardware transactional memory (HTM) selectively to use the dynamic data race detector only if necessary, thereby reducing the overhead from 11.68x to 4.65x. We then present a production-ready data race detector, which only incurs a 2.6% run-time overhead, by using performance monitoring units (PMUs) for online memory access sampling and offline unsampled memory access reconstruction. Second, for memory safety bugs, which are more common than data races, we provide practical temporal memory safety on top of the spatial memory safety of the Intel MPX in a memory-efficient manner without additional hardware support. We achieve this by reusing the existing metadata and checks already available in the Intel MPX-instrumented applications, thereby offering full memory safety at only 36% memory overhead. Finally, we design a scalable and precise function pointer analysis tool leveraging indirect call usage patterns in the Linux kernel. We applied the tool to the detection of permission check bugs; the detector found 14 previously unknown bugs within a limited time budget.
- ELASTIN: Achieving Stagnation-Free Intermittent Computation with Boundary-Free Adaptive ExecutionChoi, Jongouk; Joe, Hyunwoo; Kim, Yongjoo; Jung, Changhee (Department of Computer Science, Virginia Polytechnic Institute & State University, 2019-07-15)This paper presents ELASTIN, a stagnation-free intermittent computing system for energy-harvesting devices that ensures forward progress in the presence of frequent power outages without partitioning program into recoverable regions or tasks. ELASTIN leverages both timer-based checkpointing of volatile registers and copy-on-write mappings of nonvolatile memory pages to restore them in the wake of power failure. During each checkpoint interval, ELASTIN tracks memory writes on a per-page basis and backs up the original page using custom software-controlled memory protection without MMU or TLB. When a new interval starts at each timer expiration, ELASTIN clears the write permission of all the pages written during the previous interval and checkpoints all registers including a program counter as a recovery point. In particular, ELASTIN dynamically reconfigures both the checkpoint interval and the page size to achieve stagnation-free intermittent computation and maximize forward progress across power outages. The experiments on TI’s MSP430 board with energy harvesting traces show that ELASTIN outperforms the state-of-the-art scheme by 3.5X on average (up to orders of magnitude speedup) and guarantees forward progress.
- Empirical Evaluation of Edge Computing for Smart Building Streaming IoT ApplicationsGhaffar, Talha (Virginia Tech, 2019-03-13)Smart buildings are one of the most important emerging applications of Internet of Things (IoT). The astronomical growth in IoT devices, data generated from these devices and ubiquitous connectivity have given rise to a new computing paradigm, referred to as "Edge computing", which argues for data analysis to be performed at the "edge" of the IoT infrastructure, near the data source. The development of efficient Edge computing systems must be based on advanced understanding of performance benefits that Edge computing can offer. The goal of this work is to develop this understanding by examining the end-to-end latency and throughput performance characteristics of Smart building streaming IoT applications when deployed at the resource-constrained infrastructure Edge and to compare it against the performance that can be achieved by utilizing Cloud's data-center resources. This work also presents a real-time streaming application to detect and localize the footstep impacts generated by a building's occupant while walking. We characterize this application's performance for Edge and Cloud computing and utilize a hybrid scheme that (1) offers maximum of around 60% and 65% reduced latency compared to Edge and Cloud respectively for similar throughput performance and (2) enables processing of higher ingestion rates by eliminating network bottleneck.
- Energy and Performance Models Enabling Design Space Exploration using Domain Specific LanguagesUmar, Mariam (Virginia Tech, 2018-05-25)With the advent of exascale architectures maximizing performance while maintaining energy consumption within reasonable limits has become one of the most critical design constraints. This constraint is particularly significant in light of the power budget of 20 MWatts set by the U.S. Department of Energy for exascale supercomputing facilities. Therefore, understanding an application's characteristics, execution pattern, energy footprint, and the interactions of such aspects is critical to improving the application's performance as well as its utilization of the underlying resources. With conventional methods of analyzing performance and energy consumption trends scientists are forced to limit themselves to a manageable number of design parameters. While these modeling techniques have catered to the needs of current high-performance computing systems, the complexity and scale of exascale systems demands that large-scale design-space-exploration techniques are developed to enable comprehensive analysis and evaluations. In this dissertation we present research on performance and energy modeling of current high performance computing and future exascale systems. Our thesis is focused on the design space exploration of current and future architectures, in terms of their reconfigurability, application's sensitivity to hardware characteristics (e.g., system clock, memory bandwidth), application's execution patterns, application's communication behavior, and utilization of resources. Our research is aimed at understanding the methods by which we may maximize performance of exascale systems, minimize energy consumption, and understand the trade offs between the two. We use analytical, statistical, and machine-learning approaches to develop accurate, portable and scalable performance and energy models. We develop application and machine abstractions using Aspen (a domain specific language) to implement and evaluate our modeling techniques. As part of our research we develop and evaluate system-level performance and energy-consumption models that form part of an automated modeling framework, which analyzes application signatures to evaluate sensitivity of reconfigurable hardware components for candidate exascale proxy applications. We also develop statistical and machine-learning based models of the application's execution patterns on heterogeneous platforms. We also propose a communication and computation modeling and mapping framework for exascale proxy architectures and evaluate the framework for an exascale proxy application. These models serve as external and internal extensions to Aspen, which enable proxy exascale architecture implementations and thus facilitate design space exploration of exascale systems.
- Impact of Increased Cache Misses on Runtime Performance of MPX-enabled ProgramsSharma, Niti (Virginia Tech, 2019-06-10)Low level languages like C and C++ provide high performance and direct control over memory management. But these languages are prone to memory safety violations. Intel introduced a new ISA extension-Memory Protection Extension(MPX), a hardware-assisted full-stack solution, to protect against the memory safety violations. While MPX efficiently prevents memory errors like buffer overflows and out of bound memory accesses, it comes at the cost of high performance overheads. Also, the cache locality worsens in MPX protected applications. In our research, we analyze if there is a correlation between increase in cache misses and runtime degradation in programs compiled with MPX support. We analyze 15 SPEC CPU benchmark programs for different input sizes on Windows platform, compiled with Intel's ICC compiler. We find that for input sizes train(medium) and ref(large), the average performance overheads are 140% and 144% respectively. We find that 5 out of 15 benchmarks do not have any runtime overheads and also, do not have any change in cache misses at any level. However for rest of the 10 benchmarks, we find a strong correlation between runtime overheads and cache misses overheads, with the correlation coefficients ranging from 0.8 to 0.36 for different input sizes. Based on our findings, we conclude that there is a direct correlation between runtime overheads and increase in cache misses. We also find that instructions overheads and runtime overheads have a positive correlation, with the coefficient values ranging from 0.7 to 0.33 for different input sizes.
- Low-Level Static Analysis for Memory Usage and Control Flow RecoveryBockenek, Joshua Alexander (Virginia Tech, 2023-03-07)Formal characterization of the memory used by a program is an important basis for security analyses, compositional verification, and identification of noninterference. However, soundly proving memory usage requires operating on the assembly level due to the semantic gap between high-level languages and the code that processors actually execute. Automated methods, such as model checking, would not be able to handle many interesting functions due to the undecidability of memory usage. Fully-interactive methods do not scale well either. Sound control flow recovery (CFR) is also important for binary decompilation, verification, patching, and security analysis. It lifts raw unstructured data into a form that allows reasoning over behavior and semantics. However, doing so requires interpreting the behavior of the program when indirect or dynamic control flow exists, creating a recursive dependency. This dissertation tackles the first property with two contributions that perform proof generation combined with interactive theorem proving in a semi-automated manner: an untrusted tool extracts as much information as it can from the functions under test and then generates all the necessary proofs to be completed in a theorem prover. The first, Floyd-style approach still requires significant manual effort but provides good flexibility and ensures no paths are analyzed more than once. In contrast, the second, Hoare-style approach sacrifices some flexibility and avoidance of repeated path evaluation in order to achieve much greater automation. However, neither approach can handle the dynamic control flow caused by indirect branching. The second property is handled by the second set of contributions of this dissertation. These two contributions provide fully-automated methods of recovering control flow from binaries even in the presence of indirect branching. When such dynamic control flow cannot be overapproximatively resolved, it is clearly noted in the resultant output. In the first approach to control flow recovery, a structured memory representation allows for general analysis of control flow in the presence of indirection, gaining scalability by utilizing context-free function analysis. It supports various aliasing conditions via the usage of nondeterminism, with multiple output states potentially being produced from a given input state. The second approach adds function context and abstract interpretation-inspired modeling of the C++ exception handling (EH) application binary interface (ABI), allowing for the discovery of previously-unknown paths while maintaining or increasing automation.
- Memory Turbo Boost: Architectural Support for Using Unused Memory for Memory Replication to Boost Server Memory PerformanceZhang, Da (Virginia Tech, 2023-06-28)A significant portion of the memory in servers today is often unused. Our large-scale study of HPC systems finds that more than half of the total memory in active nodes running user jobs are unused for 88% of the time. Google and Azure Cloud studies also report unused memory accounts for 40% of the total memory in their servers, on average. Leaving so much memory unused is wasteful. To address this problem, we note that in the context of CPUs, Turbo Boost can turn off the unused cores to boost the performance of in-use cores. However, there is no equivalent technology in the context of memory; no matter how much memory is unused, the performance of in-use memory remains the same. This dissertation explores architectural techniques to utilize the unused memory to boost the performance of in-use memory and refer to them collectively as Memory Turbo Boost. This dissertation explores how to turbo boost memory performance through memory replication; specifically, it explores how to efficiently store the replicas in the unused memory and explores multiple architectural techniques to utilize the replicas to enhance memory system performance. Performance simulations show that Memory Turbo Boost can improve node-level performance by 18%, on average across a wide spectrum of workloads. Our system-wide simulations show applying Memory Turbo Boost to an HPC system provides 1.4x average speedup on job turnaround time.
- Popcorn Linux: A Compiler and Runtime for Execution Migration Between Heterogeneous-ISA ArchitecturesLyerly, Robert Frantz (Virginia Tech, 2019-04-25)In recent years there has been a proliferation of parallel and heterogeneous architectures. As chip designers have hit fundamental limits in traditional processor scaling, they have begun rethinking processor architecture from the ground up. In addition to creating new classes of processors, chip designers have revisited CPU microarchitecture in order to target different computing contexts. CPUs have been optimized for low-power smartphones and extended for high-performance computing in order to achieve better performance energy efficiency for heavy computational tasks. Although heterogeneity adds significant complexity to both hardware and software, recent works have shown tremendous power and performance benefits obtainable through specialization. It is clear that emerging systems will be increasingly heterogeneous. Many of these emerging systems couple together cores of different instruction set architectures (ISA), due to both market forces and the potential performance and power benefits in optimizing application execution. However, differently from symmetric multiprocessors or even asymmetric single-ISA multiprocessors, natively compiled applications cannot freely migrate between heterogeneous-ISA processors. This is due to the fact that applications are compiled to an instruction set architecture-specific format which is incompatible on other instruction set architectures. This creates serious limitations, as execution migration is a fundamental mechanism used by schedulers to reach performance or fairness goals, allows applications to migrate between heterogeneous-ISA CPUs in order to accelerate parallel applications or even leverage ISA-heterogeneity for security benefits. This dissertation describes system software for automatically migrating natively compiled applications across heterogeneous-ISA processors. This dissertation describes the implementation and evaluation of a complete software stack on commodity scale heterogeneous-ISA CPUs, emulating datacenters with heterogeneous-ISA systems or future systems that tightly integrate heterogeneous-ISA CPUs via point-to-point interconnect. This dissertation describes a compiler which builds applications for heterogeneous-ISA execution migration. The compiler generates machine code for every architecture in the system and lays out the application's code and data in a common format. In addition, the compiler generates metadata used by a state transformation runtime to dynamically transform thread execution state between ISA-specific formats, allowing application threads to migrate between different ISAs. The compiler and runtime is evaluated in conjunction with a replicated-kernel operating system, which provides thread migration and distributed shared virtual memory across heterogeneous-ISA processors. This redesigned software stack is evaluated on a setup containing and ARM and an x86 processor interconnected via point-to-point interconnect over PCIe. This dissertation shows that sub-millisecond state transformation is achievable. Additionally, it shows that for a datacenter-like workload using benchmarks from the NAS Parallel Benchmark suite, the system can trade some performance for up to a 66% reduction in energy and up to an 11% reduction in energy-delay product. This dissertation then describes an exploration into using hardware transactional memory (HTM) to maximize scheduling flexibility. Because applications can only migrate between ISAs at program locations with specific properties, there may be a significant delay between when the scheduler wishes to migrate an application and when the application can respond to the migration request. In order to reduce this migration response time, this dissertation describes compiler instrumentation which uses HTM to allow the scheduler to force applications to roll back to the most recently encountered program location suitable for migration. This is evaluated both in terms of overhead and responsiveness to migration requests. In addition to showing the viability of the infrastructure for optimizing workload placement in a heterogeneous-ISA datacenter, this dissertation also demonstrates utilizing the infrastructure to accelerate multithreaded applications. This dissertation describes a new OpenMP runtime named libopenpop that is optimized for executing applications in heterogeneous- ISA systems with distributed shared virtual memory. The runtime utilizes synchronization primitives that enable scale-out execution across rack-scale systems and new work distribution mechanisms that predict the best partitioning of parallel work across CPUs with diverse architectural characteristics. libopenpop demonstrates sizable improvements over a na¨ıve OpenMP implementation – a 38x improvement in multi-server barrier latency, a 5.4x improvement in multi-server data reductions and a geometric mean speedup of 4.04x for scalable applications in an 8-node x86-64 cluster. For a heterogeneous system composed of a highly-clocked x86 server and a highly-parallel ARM server, libopenpop delivers up to a 4.7x speedup and a geometric mean speedup of 41% across benchmarks from several benchmark suites versus the best single-node homogeneous execution. Finally, this dissertation describes leveraging the compiler and state transformation runtime to provide enhanced security for applications. Because the compiler provides detailed information about the stack layout of applications, it can be leveraged to defend against exploits such as stack smashing attacks and return-oriented programming attacks. This dissertation describes Chameleon, a runtime which uses the compiler and state transformation infrastructure to continuously re-randomize the stack layout and code of vulnerable applications to thwart attackers. Chameleon attaches to applications using existing operating system interfaces and periodically switches the application to new randomized stack layouts and code by rewriting the stack. Chameleon enhances security with little overhead – it disrupts a geometric mean 76.32% of code gadgets in benchmark binaries, randomizes stack element locations with geometric mean 3 potential randomized locations, and has 1.1% overhead when re-randomizing every 50 milliseconds, making it extremely difficult for attackers to exploit target applications.
- Runtime Verification and Debugging of Concurrent SoftwareZhang, Lu (Virginia Tech, 2016-07-29)Our reliance on software has been growing fast over the past decades as the pervasive use of computer and software penetrated not only our daily life but also many critical applications. As the computational power of multi-core processors and other parallel hardware keeps increasing, concurrent software that exploit these parallel computing hardware become crucial for achieving high performance. However, developing correct and efficient concurrent software is a difficult task for programmers due to the inherent nondeterminism in their executions. As a result, concurrency related software bugs are among the most troublesome in practice and have caused severe problems in recent years. In this dissertation, I propose a series of new and fully automated methods for verifying and debugging concurrent software. They cover the detection, prevention, classification, and repair of some important types of bugs in the implementation of concurrent data structures and client-side web applications. These methods can be adopted at various stages of the software development life cycle, to help programmers write concurrent software correctly as well as efficiently.