Browsing by Author "Lee, Dongyoon"
Now showing 1 - 20 of 31
Results Per Page
Sort Options
- Advanced Time Integration Methods with Applications to Simulation, Inverse Problems, and Uncertainty QuantificationNarayanamurthi, Mahesh (Virginia Tech, 2020-01-29)Simulation and optimization of complex physical systems are an integral part of modern science and engineering. The systems of interest in many fields have a multiphysics nature, with complex interactions between physical, chemical and in some cases even biological processes. This dissertation seeks to advance forward and adjoint numerical time integration methodologies for the simulation and optimization of semi-discretized multiphysics partial differential equations (PDEs), and to estimate and control numerical errors via a goal-oriented a posteriori error framework. We extend exponential propagation iterative methods of Runge-Kutta type (EPIRK) by [Tokman, JCP 2011], to build EPIRK-W and EPIRK-K time integration methods that admit approximate Jacobians in the matrix-exponential like operations. EPIRK-W methods extend the W-method theory by [Steihaug and Wofbrandt, Math. Comp. 1979] to preserve their order of accuracy under arbitrary Jacobian approximations. EPIRK-K methods extend the theory of K-methods by [Tranquilli and Sandu, JCP 2014] to EPIRK and use a Krylov-subspace based approximation of Jacobians to gain computational efficiency. New families of partitioned exponential methods for multiphysics problems are developed using the classical order condition theory via particular variants of T-trees and corresponding B-series. The new partitioned methods are found to perform better than traditional unpartitioned exponential methods for some problems in mild-medium stiffness regimes. Subsequently, partitioned stiff exponential Runge-Kutta (PEXPRK) methods -- that extend stiffly accurate exponential Runge-Kutta methods from [Hochbruck and Ostermann, SINUM 2005] to a multiphysics context -- are constructed and analyzed. PEXPRK methods show full convergence under various splittings of a diffusion-reaction system. We address the problem of estimation of numerical errors in a multiphysics discretization by developing a goal-oriented a posteriori error framework. Discrete adjoints of GARK methods are derived from their forward formulation [Sandu and Guenther, SINUM 2015]. Based on these, we build a posteriori estimators for both spatial and temporal discretization errors. We validate the estimators on a number of reaction-diffusion systems and use it to simultaneously refine spatial and temporal grids.
- An Application-Attuned Framework for Optimizing HPC Storage SystemsPaul, Arnab Kumar (Virginia Tech, 2020-08-19)High performance computing (HPC) is routinely employed in diverse domains such as life sciences, and Geology, to simulate and understand the behavior of complex phenomena. Big data driven scientific simulations are resource intensive and require both computing and I/O capabilities at scale. There is a crucial need for revisiting the HPC I/O subsystem to better optimize for and manage the increased pressure on the underlying storage systems from big data processing. Extant HPC storage systems are designed and tuned for a specific set of applications targeting a range of workload characteristics, but they lack the flexibility in adapting to the ever-changing application behaviors. The complex nature of modern HPC storage systems along with the ever-changing application behaviors present unique opportunities and engineering challenges. In this dissertation, we design and develop a framework for optimizing HPC storage systems by making them application-attuned. We select three different kinds of HPC storage systems - in-memory data analytics frameworks, parallel file systems and object storage. We first analyze the HPC application I/O behavior by studying real-world I/O traces. Next we optimize parallelism for applications running in-memory, then we design data management techniques for HPC storage systems, and finally focus on low-level I/O load balance for improving the efficiency of modern HPC storage systems.
- Bug Finding Methods for Multithreaded Student Programming ProjectsNaciri, William Malik (Virginia Tech, 2017-08-04)The fork-join framework project is one of the more challenging programming assignments in the computer science curriculum at Virginia Tech. Students in Computer Systems must manage a pool of threads to facilitate the shared execution of dynamically created tasks. This project is difficult because students must overcome the challenges of concurrent programming and conform to the project's specific semantic requirements. When working on the project, many students received inconsistent test results and were left confused when debugging. The suggested debugging tool, Helgrind, is a general-purpose thread error detector. It is limited in its ability to help fix bugs because it lacks knowledge of the specific semantic requirements of the fork-join framework. Thus, there is a need for a special-purpose tool tailored for this project. We implemented Willgrind, a debugging tool that checks the behavior of fork-join frameworks implemented by students through dynamic program analysis. Using the Valgrind framework for instrumentation, checking statements are inserted into the code to detect deadlock, ordering violations, and semantic violations at run-time. Additionally, we extended Willgrind with happens-before based checking in WillgrindPlus. This tool checks for ordering violations that do not manifest themselves in a given execution but could in others. In a user study, we provided the tools to 85 students in the Spring 2017 semester and collected over 2,000 submissions. The results indicate that the tools are effective at identifying bugs and useful for fixing bugs. This research makes multithreaded programming easier for students and demonstrates that special-purpose debugging tools can be beneficial in computer science education.
- Constraint-Based Thread-Modular Abstract InterpretationKusano, Markus Jan Urban (Virginia Tech, 2018-07-25)In this dissertation, I present a set of novel constraint-based thread-modular abstract-interpretation techniques for static analysis of concurrent programs. Specifically, I integrate a lightweight constraint solver into a thread-modular abstract interpreter to reason about inter-thread interference more accurately. Then, I show how to extend the new analyzer from programs running on sequentially consistent memory to programs running on weak memory. Finally, I show how to perform incremental abstract interpretation, with and without the previously mentioned constraint solver, by analyzing only regions of the program impacted by a program modification. I also demonstrate, through experiments, that these new constraint-based static analyzers are significantly more accurate than prior abstract interpretation-based static analyzers, with lower runtime overhead, and that the incremental technique can drastically reduce runtime overhead in the presence of small program modifications.
- Data Filtering and Modeling for Smart Manufacturing NetworkLi, Yifu (Virginia Tech, 2020-08-13)A smart manufacturing network connects machines via sensing, communication, and actuation networks. The data generated from the networks are used in data-driven modeling and decision-making to improve quality, productivity, and flexibility while reducing the cost. This dissertation focuses on improving the data-driven modeling of the quality-process relationship in smart manufacturing networks. The quality-process variable relationships are important to understand for guiding the quality improvement by optimizing the process variables. However, several challenges emerge. First, the big data sets generated from the manufacturing network may be information-poor for modeling, which may lead to high data transmission and computational loads and redundant data storage. Second, the data generated from connected machines often contain inexplicit similarities due to similar product designs and manufacturing processes. Modeling such inexplicit similarities remains challenging. Third, it is unclear how to select representative data sets for modeling in a manufacturing network setting, considering inexplicit similarities. In this dissertation, a data filtering method is proposed to select a relatively small and informative data subset. Multi-task learning is combined with latent variable decomposition to model multiple connected manufacturing processes that are similar-but-non-identical. A data filtering and modeling framework is also proposed to filter the manufacturing data for manufacturing network modeling adaptively. The proposed methodologies have been validated through simulation and the applications to real manufacturing case studies.
- Designing Practical Software Bug Detectors Using Commodity Hardware and Common Programming PatternsZhang, Tong (Virginia Tech, 2020-01-13)Software bugs can cost millions and affect people's daily lives. However, many bug detection tools are not always practical in reality, which hinders their wide adoption. There are three main concerns regarding existing bug detectors: 1) run-time overhead in dynamic bug detectors, 2) space overhead in dynamic bug detectors, and 3) scalability and precision issues in static bug detectors. With those in mind, we propose to: 1) leverage commodity hardware to reduce run-time overhead, 2) reuse metadata maintained by one bug detector to detect other types of bugs, reducing space overhead, and 3) apply programming idioms to static analyses, improving scalability and precision. We demonstrate the effectiveness of three approaches using data race bugs, memory safety bugs, and permission check bugs, respectively. First, we leverage the commodity hardware transactional memory (HTM) selectively to use the dynamic data race detector only if necessary, thereby reducing the overhead from 11.68x to 4.65x. We then present a production-ready data race detector, which only incurs a 2.6% run-time overhead, by using performance monitoring units (PMUs) for online memory access sampling and offline unsampled memory access reconstruction. Second, for memory safety bugs, which are more common than data races, we provide practical temporal memory safety on top of the spatial memory safety of the Intel MPX in a memory-efficient manner without additional hardware support. We achieve this by reusing the existing metadata and checks already available in the Intel MPX-instrumented applications, thereby offering full memory safety at only 36% memory overhead. Finally, we design a scalable and precise function pointer analysis tool leveraging indirect call usage patterns in the Linux kernel. We applied the tool to the detection of permission check bugs; the detector found 14 previously unknown bugs within a limited time budget.
- Detecting Persistence Bugs from Non-volatile Memory Programs by Inferring Likely-correctness ConditionsFu, Xinwei (Virginia Tech, 2022-03-10)Non-volatile main memory (NVM) technologies are revolutionizing the entire computing stack thanks to their storage-and-memory-like characteristics. The ability to persist data in memory provides a new opportunity to build crash-consistent software without paying a storage stack I/O overhead. A crash-consistent NVM program can recover back to a consistent state from a persistent NVM in the event of a software crash or a sudden power loss. In the presence of a volatile cache, data held in a volatile cache is lost after a crash. So NVM programming requires users to manually control the durability and the persistence ordering of NVM writes. To avoid performance overhead, developers have devised customized persistence mechanisms to enforce proper persistence ordering and atomicity guarantees, rendering NVM programs error-prone. The problem statement of this dissertation is how one can effectively detect persistence bugs from NVM programs. However, detecting persistence bugs in NVM programs is challenging because of the huge test space and the manual consistency validation required. The thesis of this dissertation is that we can detect persistence bugs from NVM programs in a scalable and automatic manner by inferring likely-correctness conditions from programs. A likely-correctness condition is a possible correctness condition, which is a condition a program must maintain to make the program crash-consistent. This dissertation proposes to infer two forms of likely-correctness conditions from NVM programs to detect persistence bugs. The first proposed solution is to infer likely-ordering and likely-atomicity conditions by analyzing program dependencies among NVM accesses. The second proposed solution is to infer likely-linearization points to understand a program's operation-level behavior. Using these two forms of likely-correctness conditions, we test only those NVM states and thread interleavings that violate the likely-correctness conditions. This significantly re- duces the test space required to examine. We then leverage the durable linearizability model to validate consistency automatically without manual consistency validation. In this way, we can detect persistence bugs from NVM programs in a scalable and automatic manner. In total, we detect 47 (36 new) persistence correctness bugs and 158 (113 new) persistence performance bugs from 20 single-threaded NVM programs. Additionally, we detect 27 (15 new) persistence correctness bugs from 12 multi-threaded NVM data structures.
- Efficient Symbolic Execution of Concurrent SoftwareGuo, Shengjian (Virginia Tech, 2019-04-26)Concurrent software has been widely utilizing in computer systems owing to the highly efficient computation. However, testing and verifying concurrent software remain challenging tasks. This matter is not only because of the non-deterministic thread interferences which are hard to reason about but also because of the large state space due to the simultaneous path and interleaving explosions. That is, the number of program paths in each thread may be exponential in the number of branch conditions, and also, the number of thread interleavings may be exponential in the number of concurrent operations. This dissertation presents a set of new methods, built upon symbolic execution, a program analysis technique that systematically explores program state space, for testing concurrent programs. By modeling both functional and non-functional properties of the programs as assertions, these new methods efficiently analyze the viable behaviors of the given concurrent programs. The first method is assertion guided symbolic execution, a state space reduction technique that identifies and eliminates redundant executions w.r.t the explored interleavings. The second method is incremental symbolic execution, which generates test inputs only for the influenced program behaviors by the small code changes between two program versions. The third method is SYMPLC, a technique with domain-specific reduction strategies for generating tests for the multitasking Programmable Logic Controller (PLC) programs written in languages specified by the IEC 61131-3 standard. The last method is adversarial symbolic execution, a technique for detecting concurrency related side-channel information leaks by analyzing the cache timing behaviors of a concurrent program in symbolic execution. This dissertation evaluates the proposed methods on a diverse set of both synthesized programs and real-world applications. The experimental results show that these techniques can significantly outperform state-of-the-art symbolic execution tools for concurrent software.
- Empirical Evaluation of Edge Computing for Smart Building Streaming IoT ApplicationsGhaffar, Talha (Virginia Tech, 2019-03-13)Smart buildings are one of the most important emerging applications of Internet of Things (IoT). The astronomical growth in IoT devices, data generated from these devices and ubiquitous connectivity have given rise to a new computing paradigm, referred to as "Edge computing", which argues for data analysis to be performed at the "edge" of the IoT infrastructure, near the data source. The development of efficient Edge computing systems must be based on advanced understanding of performance benefits that Edge computing can offer. The goal of this work is to develop this understanding by examining the end-to-end latency and throughput performance characteristics of Smart building streaming IoT applications when deployed at the resource-constrained infrastructure Edge and to compare it against the performance that can be achieved by utilizing Cloud's data-center resources. This work also presents a real-time streaming application to detect and localize the footstep impacts generated by a building's occupant while walking. We characterize this application's performance for Edge and Cloud computing and utilize a hybrid scheme that (1) offers maximum of around 60% and 65% reduced latency compared to Edge and Cloud respectively for similar throughput performance and (2) enables processing of higher ingestion rates by eliminating network bottleneck.
- Exploring the Process and Challenges of Programming with Regular ExpressionsMichael, Louis Guy IV (Virginia Tech, 2019-06-27)Regular expressions (regexes) are a powerful mechanism for solving string-matching problems and are supported by all modern programming languages. While regular expressions are highly expressive, they are also often perceived to be highly complex and hard to read. While existing studies have focused on improving the readability of regular expressions, little is known about any other difficulties that developers face when programming with regular expressions. In this paper, we aim to provide a deeper understanding of the process of programming regular expressions by studying: (1) how developers make decisions through the process, (2) what difficulties they face, and (3) how aware they are about serious risks involved in programming regexes. We surveyed 158 professional developers from a diversity of backgrounds, and we conducted a series of interviews to learn more details about the difficulties and solutions that participants face in this process. This mixed methods approach revealed that some of the difficulties of regexes come in the shape of: inability to effectively search for them; fully validate them; and document them. Developers also reported cascading impacts of poor readability, lack of universal portability, and struggling with overall problem comprehension. The majority of our studied developers were unaware of critical security risks that can occur when using regexes, and those that were aware of potential problems felt that they lacked the ability to identify problematic regexes. Our findings provide multiple implications for future work, including development of semantic regex search engines for regex reuse, and improved input generators for regex validation.
- Hybrid Analysis Tools for Computer Systems EducationWilliamson, Eric Robert (Virginia Tech, 2018-07-09)To learn about computer operating systems, students at Virginia Tech implement a command-line shell in their Computer Systems course. Successfully implementing the shell requires a deep understanding of operating system abstractions and interactions. Students often struggle with the project because subtle errors can take hours to debug. In this work, we developed two hybrid domain-specific analysis tools to pinpoint the root causes of student errors: EshMD and ShellTrace. The EshMD tool models common errors in the shell and checks the student code against those models. To accomplish this, it monitors the specific calls the program is making and correlates those with expected changes in its environment. Students' errors are shown directly in the source code. The concept of EshMD can be applied to other programming projects by observing and modeling common bugs during implementation. The ShellTrace tool dynamically creates a specification from a reference solution based on how the reference solution makes use of operating system resources and then uses this specification to check that a student solution is functionally identical. The ShellTrace concept can be applied to other programs that exhibit similar resource dependencies. We deployed these tools in an undergraduate computer systems class and evaluated our tools based on the number of bugs detected and the students' perceptions of usefulness. We found that the tools detected a significant number of bugs and that the majority of students that made use of the tools found them valuable in debugging their submissions.
- Impact of Increased Cache Misses on Runtime Performance of MPX-enabled ProgramsSharma, Niti (Virginia Tech, 2019-06-10)Low level languages like C and C++ provide high performance and direct control over memory management. But these languages are prone to memory safety violations. Intel introduced a new ISA extension-Memory Protection Extension(MPX), a hardware-assisted full-stack solution, to protect against the memory safety violations. While MPX efficiently prevents memory errors like buffer overflows and out of bound memory accesses, it comes at the cost of high performance overheads. Also, the cache locality worsens in MPX protected applications. In our research, we analyze if there is a correlation between increase in cache misses and runtime degradation in programs compiled with MPX support. We analyze 15 SPEC CPU benchmark programs for different input sizes on Windows platform, compiled with Intel's ICC compiler. We find that for input sizes train(medium) and ref(large), the average performance overheads are 140% and 144% respectively. We find that 5 out of 15 benchmarks do not have any runtime overheads and also, do not have any change in cache misses at any level. However for rest of the 10 benchmarks, we find a strong correlation between runtime overheads and cache misses overheads, with the correlation coefficients ranging from 0.8 to 0.36 for different input sizes. Based on our findings, we conclude that there is a direct correlation between runtime overheads and increase in cache misses. We also find that instructions overheads and runtime overheads have a positive correlation, with the coefficient values ranging from 0.7 to 0.33 for different input sizes.
- Improving Performance of Highly-Programmable Concurrent Applications by Leveraging Parallel Nesting and Weaker Isolation LevelsNiles, Duane Francis Jr. (Virginia Tech, 2015-07-15)The recent development of multi-core computer architectures has largely affected the creation of everyday applications, requiring the adoption of concurrent programming to significantly utilize the divided processing power of computers. Applications must be split into sections able to execute in parallel, without any of these sections conflicting with one another, thereby necessitating some form of synchronization to be declared. The most commonly used methodology is lock-based synchronization; although, to improve performance the most, developers must typically form complex, low-level implementations for large applications, which can easily create potential errors or hindrances. An abstraction from database systems, known as transactions, is a rising concurrency control design aimed to circumvent the challenges with programmability, composability, and scalability in lock-based synchronization. Transactions execute their operations speculatively and are capable of being restarted (or rolled back) when there exist conflicts between concurrent actions. As such issues can occur later in the lifespans of transactions, entire rollbacks are not that effective for performance. One particular method, known as nesting, was created to counter that drawback. Nesting is the act of enclosing transactions within other transactions, essentially dividing the work into pieces called sub-transactions. These sub-transactions can roll back without affecting the entire main transaction, although general nesting models only allow one sub-transaction to perform work at a time. The first main contribution in this thesis is SPCN, an algorithm that parallelizes nested transactions while automatically processing any potential conflicts that may arise, eliminating the burden of additional processing from the application developers. Two versions of SPCN exist: Strict, which enforces the sub-transactions' work to be made visible in a serialized order; and Relaxed, which allows sub-transactions to distribute their information immediately as they finish (therefore invalidation may occur after-the-fact and must be handled). Despite the additional logic required by SPCN, it outperforms traditional closed nesting by 1.78x at the lowest and 3.78x at the highest in the experiments run. Another method to alter transactional execution and boost performance is to relax the rules of visibility for parallel operations (known as their isolation). Depending on the application, correctness is not broken even if some transactions see external work that may later be undone due to a rollback, or if an object is written while another transaction is using an older instance of its data. With lock-based synchronization, developers would have to explicitly design their application with varying amounts of locks, and different lock organizations or hierarchies, to change the strictness of the execution. With transactional systems, the processing performed by the system itself can be set to utilize different rulings, which can change the performance of an application without requiring it to be largely redesigned. This notion leads to the second contribution in this thesis: AsR, or As-Serializable transactions. Serializability is the general form of isolation or strictness for transactions in many applications. In terms of execution, its definition is equivalent to only one transaction running at a time in a given system. Many transactional systems use their own internal form of locking to create Serializable executions, but it is typically too strict for many applications. AsR transactions allow the internal processing to be relaxed while additional meta-data is used external to the system, without requiring any interaction from the developer or any changes to the given application. AsR transactions offer multiple orders of magnitude more in throughput in highly-contentious scenarios, due to their capability to outlast traditional levels of isolation.
- Improving TCP Data Transportation for Internet of ThingsKhan, Jamal Ahmad (Virginia Tech, 2018-08-31)Internet of Things (IoT) is the idea that every device around us is connected and these devices continually collect and communicate data for analysis at a large scale in order to enable better end user experience, resource utilization and device performance. Therefore, data is central to the concept of IoT and the amount being collected is growing at an unprecedented rate. Current networking systems and hardware are not fully equipped to handle influx of data at this scale which is a serious problem because it can lead to erroneous interpretation of the data resulting in low resource utilization and bad end user experience defeating the purpose of IoT. This thesis aims at improving data transportation for IoT. In IoT systems, devices are connected to one or more cloud services over the internet via an access link. The cloud processes the data sent by the devices and sends back appropriate instructions. Hence, the performance of the two ends of the network ie the access networks and datacenter network, directly impacts the performance of IoT. The first portion of the our research targets improvement of the access networks by improving access link (router) design. Among the important design aspects of routers is the size of their output buffer queue. %Selecting an appropriate size of this buffer is crucial because it impacts two key metrics of an IoT system: 1) access link utilization and 2) latency. We have developed a probabilistic model to calculate the size of the output buffer that ensures high link utilization and low latency for packets. We have eliminated limiting assumptions of prior art that do not hold true for IoT. Our results show that for TCP only traffic, buffer size calculated by the state of the art schemes results in at least 60% higher queuing delay compared to our scheme while achieving almost similar access link utilization, loss-rate, and goodput. For UDP only traffic, our scheme achieves at least 91% link utilization with very low queuing delays and aggregate goodput that is approx. 90% of link capacity. Finally, for mixed traffic scenarios our scheme achieves higher link utilization than TCP only and UDP only scenarios as well as low delays, low loss-rates and aggregate goodput that is approx 94% of link capacity. The second portion of the thesis focuses on datacenter networks. Applications that control IoT devices reside here. Performance of these applications is affected by the choice of TCP used for data communication between Virtual Machines (VM). However, cloud users have little to no knowledge about the network between the VMs and hence, lack a systematic method to select a TCP variant. We have focused on characterizing TCP Cubic, Reno, Vegas and DCTCP from the perspective of cloud tenants while treating the network as a black box. We have conducted experiments on the transport layer and the application layer. The observations from our transport layer experiments show TCP Vegas outperforms the other variants in terms of throughput, RTT, and stability. Application layer experiments show that Vegas has the worst response time while all other variants perform similarly. The results also show that different inter-request delay distributions have no effect on the throughput, RTT, or response time.
- Memory Turbo Boost: Architectural Support for Using Unused Memory for Memory Replication to Boost Server Memory PerformanceZhang, Da (Virginia Tech, 2023-06-28)A significant portion of the memory in servers today is often unused. Our large-scale study of HPC systems finds that more than half of the total memory in active nodes running user jobs are unused for 88% of the time. Google and Azure Cloud studies also report unused memory accounts for 40% of the total memory in their servers, on average. Leaving so much memory unused is wasteful. To address this problem, we note that in the context of CPUs, Turbo Boost can turn off the unused cores to boost the performance of in-use cores. However, there is no equivalent technology in the context of memory; no matter how much memory is unused, the performance of in-use memory remains the same. This dissertation explores architectural techniques to utilize the unused memory to boost the performance of in-use memory and refer to them collectively as Memory Turbo Boost. This dissertation explores how to turbo boost memory performance through memory replication; specifically, it explores how to efficiently store the replicas in the unused memory and explores multiple architectural techniques to utilize the replicas to enhance memory system performance. Performance simulations show that Memory Turbo Boost can improve node-level performance by 18%, on average across a wide spectrum of workloads. Our system-wide simulations show applying Memory Turbo Boost to an HPC system provides 1.4x average speedup on job turnaround time.
- Modeling and Runtime Systems for Coordinated Power-Performance ManagementLi, Bo (Virginia Tech, 2019-01-28)Emergent systems in high-performance computing (HPC) expect maximal efficiency to achieve the goal of power budget under 20-40 megawatts for 1 exaflop set by the Department of Energy. To optimize efficiency, emergent systems provide multiple power-performance control techniques to throttle different system components and scale of concurrency. In this dissertation, we focus on three throttling techniques: CPU dynamic voltage and frequency scaling (DVFS), dynamic memory throttling (DMT), and dynamic concurrency throttling (DCT). We first conduct an empirical analysis of the performance and energy trade-offs of different architectures under the throttling techniques. We show the impact on performance and energy consumption on Intel x86 systems with accelerators of Intel Xeon Phi and a Nvidia general-purpose graphics processing unit (GPGPU). We show the trade-offs and potentials for improving efficiency. Furthermore, we propose a parallel performance model for coordinating DVFS, DMT, and DCT simultaneously. We present a multivariate linear regression-based approach to approximate the impact of DVFS, DMT, and DCT on performance for performance prediction. Validation using 19 HPC applications/kernels on two architectures (i.e., Intel x86 and IBM BG/Q) shows up to 7% and 17% prediction error correspondingly. Thereafter, we develop the metrics for capturing the performance impact of DVFS, DMT, and DCT. We apply the artificial neural network model to approximate the nonlinear effects on performance impact and present a runtime control strategy accordingly for power capping. Our validation using 37 HPC applications/kernels shows up to a 20% performance improvement under a given power budget compared with the Intel RAPL-based method.
- Multicore Scalability Through Asynchronous WorkMathew, Ajit (Virginia Tech, 2020-01-13)With the end of Moore's Law, computer architects have turned to multicore architecture to provide high performance. Unfortunately, to achieve higher performance, multicores require programs to be parallelized which is an untamed problem. Amdahl's law tells that the maximum theoretical speedup of a program is dictated by the size of the non-parallelizable section of a program. Hence to achieve higher performance, programmers need to reduce the size of sequential code in the program. This thesis explores asynchronous work as a means to reduce sequential portions of program. Using asynchronous work, a programmer can remove tasks which do not affect data consistency from the critical path and can be performed using background thread. Using this idea, the thesis introduces two systems. First, a synchronization mechanism, Multi-Version Read-Log-Update(MV-RLU), which extends Read-Log-Update (RLU) through multi-versioning. At the core of MV-RLU design is a concurrent garbage collection algorithm which reclaims obsolete versions asynchronously reducing blocking of threads. Second, a concurrent and highly scalable index-structure called Hydralist for multi-core. The key idea behind design of Hydralist is that an index-structure can be divided into two component (search layer and data layer) and updates to data layer can be done synchronously while updates to search layer can be propagated asynchronously using background threads.
- Mutex Locking versus Hardware Transactional Memory: An Experimental EvaluationMoore, Sean Ryan (Virginia Tech, 2015-09-16)It has historically been the case that CPUs have run programs ever faster without significant intervention on the behalf of the programmer. However, this "free lunch" has largely ended due to the end of exponentially increasing core frequency and the current slow increase in instruction-level parallelism but continues to a degree in cache size improvements. But since Moore's law still largely continues "lunch", i.e. program performance, can still be bought at the price of rewriting code for multiple cores, which is enabled by the trend Moore's law describes. Multicore architectures cannot aid performance for problems whose solutions are necessarily sequential in nature and writing efficient and correct concurrent programs is not easy in all cases when using synchronization methods like fine-grained mutex locks. Transactional memory, and its implementation as hardware transactional memory, allow programmers to write concurrent applications without the attendant complexity of programming with mutex locks. This allows programmers to focus on optimizing the application for performance. Given that transactions can run two segments of code in parallel that a mutex lock would force to run sequentially and that transactions can abort, causing a program to do the same work more than once, whether transactions perform better or worse than mutex locks is dependent on the program's execution profile and the coarseness or fineness at which mutex locks are used. In this thesis the GNU C Library's futex implementation of mutex locks and Intel's Restricted Transactional Memory have been compared and the behavior of those transactions has been analyzed. This analysis includes a pathological behavior permitted by the GNU C Library's hardware transactional memory implementation of mutex locks. The tradeoffs between fine-grained and global locking implementations have been discussed, compared, and used in the context of fallback locks for hardware transactions. This thesis provides evidence to the effect that fine-grained locking is not critical for program performance and that in many cases global locking and hardware transactions can provide nearly equivalent performance without the programming difficulties. This work has shown that across the 23 applications examined, with relation to their original locking implementation, a global locking scheme without elision has a 0.96x speedup, Intel's Restricted Transactional Memory (RTM) with the application's original locks as a fallback has a 1.01x speedup and with global lock fallback RTM has a speedup of 0.97x. This work is supported in part by NAVSEA/NEEC under grant 3003279297. Any opinions, findings, and conclusions or recommendations expressed in this thesis are those of the author and do not necessarily reflect the views of NAVSEA.
- On Reducing the Trusted Computing Base in Binary VerificationAn, Xiaoxin (Virginia Tech, 2022-06-15)The translation of binary code to higher-level models has wide applications, including decompilation, binary analysis, and binary rewriting. This calls for high reliability of the underlying trusted computing base (TCB) of the translation methodology. A key challenge is to reduce the TCB by validating its soundness. Both the definition of soundness and the validation method heavily depend on the context: what is in the TCB and how to prove it. This dissertation presents three research contributions. The first two contributions include reducing the TCB in binary verification, and the last contribution includes a binary verification process that leverages a reduced TCB. The first contribution targets the validation of OCaml-to-PVS translation -- commonly used to translate instruction-set-architecture (ISA) specifications to PVS -- where the destination language is non-executable. We present a methodology called OPEV to validate the translation between OCaml and PVS, supporting non-executable semantics. The validation includes generating large-scale tests for OCaml implementations, generating test lemmas for PVS, and generating proofs that automatically discharge these lemmas. OPEV incorporates an intermediate type system that captures a large subset of OCaml types, employing a variety of rules to generate test cases for each type. To prove the PVS lemmas, we develop automatic proof strategies and discharge the test lemmas using PVS Proof-Lite, a powerful proof scripting utility of the PVS verification system. We demonstrate our approach in two case studies that include 259 functions selected from the Sail and Lem libraries. For each function, we generate thousands of test lemmas, all of which are automatically discharged. The dissertation's second contribution targets the soundness validation of a disassembly process where the source language does not have well-defined semantics. Disassembly is a crucial step in binary security, reverse engineering, and binary verification. Various studies in these fields use disassembly tools and hypothesize that the reconstructed disassembly is correct. However, disassembly is an undecidable problem. State-of-the-art disassemblers suffer from issues ranging from incorrectly recovered instructions to incorrectly assessing which addresses belong to instructions and which to data. We present DSV, a systematic and automated approach to validate whether the output of a disassembler is sound with respect to the input binary. No source code, debugging information, or annotations are required. DSV defines soundness using a transition relation defined over concrete machine states: a binary is sound if, for all addresses in the binary that can be reached from the binary's entry point, the bytes of the (disassembled) instruction located at an address are the same as the actual bytes read from the binary. Since computing this transition relation is undecidable, DSV uses over-approximation by preventing false positives (i.e., the existence of an incorrectly disassembled reachable instruction but deemed unreachable) and allowing, but minimizing, false negatives. We apply DSV to 102 binaries of GNU Coreutils with eight different state-of-the-art disassemblers from academia and industry. DSV is able to find soundness issues in the output of all disassemblers. The dissertation's third contribution is WinCheck: a concolic model checker that detects memory-related properties of closed-source binaries. Bugs related to memory accesses are still a major issue for security vulnerabilities. Even a single buffer overflow or use-after-free in a large program may be the cause of a software crash, a data leak, or a hijacking of the control flow. Typical static formal verification tools aim to detect these issues at the source code level. WinCheck is a model-checker that is directly applicable to closed-source and stripped Windows executables. A key characteristic of WinCheck is that it performs its execution as symbolically as possible while leaving any information related to pointers concrete. This produces a model checker tailored to pointer-related properties, such as buffer overflows, use-after-free, null-pointer dereferences, and reading from uninitialized memory. The technique thus provides a novel trade-off between ease of use, accuracy, applicability, and scalability. We apply WinCheck to ten closed-source binaries available in a Windows 10 distribution, as well as the Windows version of the entire Coreutils library. We conclude that the approach taken is precise -- provides only a few false negatives -- but may not explore the entire state space due to unresolved indirect jumps.
- On the Impact and Defeat of Regular Expression Denial of ServiceDavis, James Collins (Virginia Tech, 2020-05-28)Regular expressions (regexes) are a widely-used yet little-studied software component. Engineers use regexes to match domain-specific languages of strings. Unfortunately, many regex engine implementations perform these matches with worst-case polynomial or exponential time complexity in the length of the string. Because they are commonly used in user-facing contexts, super-linear regexes are a potential denial of service vector known as Regular expression Denial of Service (ReDoS). Part I gives the necessary background to understand this problem. In Part II of this dissertation, I present the first large-scale empirical studies of super-linear regex use. Guided by case studies of ReDoS issues in practice (Chapter 3), I report that the risk of ReDoS affects up to 10% of the regexes used in practice (Chapter 4), and that these findings generalize to software written in eight popular programming languages (Chapter 5). ReDoS appears to be a widespread vulnerability, motivating the consideration of defenses. In Part III I present the first systematic comparison of ReDoS defenses. Based on the necessary conditions for ReDoS, a ReDoS defense can be erected at the application level, the regex engine level, or the framework/runtime level. In my experiments I report that application-level defenses are difficult and error prone to implement (Chapter 6), that finding a compatible higher-performing regex engine is unlikely (Chapter 7), that optimizing an existing regex engine using memoization incurs (perhaps acceptable) space overheads (Chapter 8), and that incorporating resource caps into the framework or runtime is feasible but faces barriers to adoption (Chapter 9). In Part IV of this dissertation, we reflect on our findings. By leveraging empirical software engineering techniques, we have exposed the scope of potential ReDoS vulnerabilities, and given strong motivation for a solution. To assist practitioners, we have conducted a systematic evaluation of the solution space. We hope that our findings assist in the elimination of ReDoS, and more generally that we have provided a case study in the value of data-driven software engineering.