Browsing by Author "Ravindran, Binoy"
Now showing 1 - 20 of 114
Results Per Page
Sort Options
- Adelie: Continuous Address Space Layout Re-randomization for Linux DriversNikolaev, Ruslan; Nadeem, Hassan; Stone, Cathlyn; Ravindran, Binoy (ACM, 2022-02-28)While address space layout randomization (ASLR) has been extensively studied for user-space programs, the corresponding OS kernel’s KASLR support remains very limited, making the kernel vulnerable to just-in-time (JIT) return-oriented programming (ROP) attacks. Furthermore, commodity OSs such as Linux restrict their KASLR range to 32 bits due to architectural constraints (e.g., x86-64 only supports 32-bit immediate operands for most instructions), which makes them vulnerable to even unsophisticated brute-force ROP attacks due to low entropy. Most in-kernel pointers remain static, exacerbating the problem when pointers are leaked. Adelie, our kernel defense mechanism, overcomes KASLR limitations, increases KASLR entropy, and makes successful ROP attacks on the Linux kernel much harder to achieve. First, Adelie enables the position-independent code (PIC) model so that the kernel and its modules can be placed anywhere in the 64-bit virtual address space, at any distance apart from each other. Second, Adelie implements stack re-randomization and address encryption on modules. Finally, Adelie enables efficient continuous KASLR for modules by using the PIC model to make it (almost) impossible to inject ROP gadgets through these modules regardless of gadget’s origin. Since device drivers (typically compiled as modules) are often developed by third parties and are typically less tested than core OS parts, they are also often more vulnerable. By fully re-randomizing device drivers, the last two contributions together prevent most JIT ROP attacks since vulnerable modules are very likely to be a starting point of an attack. Furthermore, some OS instances in virtualized environments are specifically designated to run device drivers, where drivers are the primary target of JIT ROP attacks. Using a GCC plugin that we developed, we automatically modify different kinds of kernel modules. Since the prior art tackles only user-space programs, we solve many challenges unique to the kernel code. Our evaluation shows high efficiency of Adelie’s approach: the overhead of the PIC model is completely negligible and re-randomization cost remains reasonable for typical use cases.
- Aggregate VM: Why Reduce or Evict VM's Resources When You Can Borrow Them From Other Nodes?Chuang, Ho-Ren; Manaouil, Karim; Xing, Tong; Barbalace, Antonio; Olivier, Pierre; Heerekar, Balvansh; Ravindran, Binoy (ACM, 2023-05-08)Hardware resource fragmentation is a common issue in data centers. Traditional solutions based on migration or overcommitment are unacceptably slow, and modern commercial or research solutions like Spot VM may reduce or evict VM’s resources anytime.We propose an alternative solution that does not suffer from these drawbacks, the Aggregate VM.We introduce a new distributed hypervisor design, the resource-borrowing hypervisor, which creates Aggregate VMs: distributed VMs that temporarily aggregate fragmented resources belonging to different host machines, which require mobility of virtual CPUs, memory and IO devices.We implement a prototype, FragVisor, which runs guest software transparently.We also propose minimal modifications to the guest OS that can enable significant performance gains. We evaluate FragVisor over a set of microbenchmarks and IaaS-style real applications. Although Aggregate VMs are not a perfect fit for every type of applications, some workloads enjoy significant speedups compared to overcommitted scenarios (up to 3.9x with 4 distributed vCPUs).We further demonstrate that FragVisor is faster than a state-of-the-art competitor, GiantVM (up to 2.5x).
- Applying Source Level Auto-Vectorization to Aparapi JavaAlbert, Frank Curtis (Virginia Tech, 2014-06-19)Ever since chip manufacturers hit the power wall preventing them from increasing processor clock speed, there has been an increased push towards parallelism for performance improvements. This parallelism comes in the form of both data parallel single instruction multiple data (SIMD) instructions, as well as parallel compute cores in both central processing units (CPUs) and graphics processing units (GPUs). While these hardware enhancements offer potential performance enhancements, programs must be re-written to take advantage of them in order to see any performance improvement Some lower level languages that compile directly to machine code already take advantage of the data parallel SIMD instructions, but often higher level interpreted languages do not. Java, one of the most popular programming languages in the world, still does not include support for these SIMD instructions. In this thesis, we present a vector library that implements all of the major SIMD instructions in functions that are accessible to Java through JNI function calls. This brings the benefits of general purpose SIMD functionality to Java. This thesis also works with the data parallel Aparapi Java extension to bring these SIMD performance improvements to programmers who use the extension without any additional effort on their part. Aparapi already provides programmers with an API that allows programmers to declare certain sections of their code parallel. These parallel sections are then run on OpenCL capable hardware with a fallback path in the Java thread pool to ensure code reliability. This work takes advantage of the knowledge of independence of the parallel sections of code to automatically modify the Java thread pool fallback path to include the vectorization library through the use of an auto-vectorization tool created for this work. When the code is not vectorizable the auto-vectorizer tool is still able to offer performance improvements over the default fallback path through an improved looped implementation that executes the same code but with less overhead. Experiments conducted by this work illustrate that for all 10 benchmarks tested the auto-vectorization tool was able to produce an implementation that was able to beat the default Aparapi fallback path. In addition it was found that this improved fallback path even outperformed the GPU implementation for several of the benchmarks tested.
- Architectural Enhancements to Increase Trust in Cyber-Physical Systems Containing Untrusted Software and HardwareFarag, Mohammed Morsy Naeem (Virginia Tech, 2012-09-17)Embedded electronics are widely employed in cyber-physical systems (CPSes), which tightly integrate and coordinate computational and physical elements. CPSes are extensively deployed in security-critical applications and nationwide infrastructure. Perimeter security approaches to preventing malware infiltration of CPSes are challenged by the complexity of modern embedded systems incorporating numerous heterogeneous and updatable components. Global supply chains and third-party hardware components, tools, and software limit the reach of design verification techniques and introduce security concerns about deliberate Trojan inclusions. As a consequence, skilled attacks against CPSes have demonstrated that these systems can be surreptitiously compromised. Existing run-time security approaches are not adequate to counter such threats because of either the impact on performance and cost, lack of scalability and generality, trust needed in global third parties, or significant changes required to the design flow. We present a protection scheme called Run-time Enhancement of Trusted Computing (RETC) to enhance trust in CPSes containing untrusted software and hardware. RETC is complementary to design-time verification approaches and serves as a last line of defense against the rising number of inexorable threats against CPSes. We target systems built using reconfigurable hardware to meet the flexibility and high-performance requirements of modern security protections. Security policies are derived from the system physical characteristics and component operational specifications and translated into synthesizable hardware integrated into specific interfaces on a per-module or per-function basis. The policy-based approach addresses many security challenges by decoupling policies from system-specific implementations and optimizations, and minimizes changes required to the design flow. Interface guards enable in-line monitoring and enforcement of critical system computations at run-time. Trust is only required in a small set of simple, self-contained, and verifiable guard components. Hardware trust anchors simultaneously addresses the performance, flexibility, developer productivity, and security requirements of contemporary CPSes. We apply RETC to several CPSes having common security challenges including: secure reconfiguration control in reconfigurable cognitive radio platforms, tolerating hardware Trojan threats in third-party IP cores, and preserving stability in process control systems. High-level architectures demonstrated with prototypes are presented for the selected applications. Implementation results illustrate the RETC efficiency in terms of the performance and overheads of the hardware trust anchors. Testbenches associated with the addressed threat models are generated and experimentally validated on reconfigurable platform to establish the protection scheme efficacy in thwarting the selected threats. This new approach significantly enhances trust in CPSes containing untrusted components without sacrificing cost and performance.
- Automatic Co-Synthesis of Hardware and Software Safety Monitors for Embedded SystemsRezvani, Behnaz (Virginia Tech, 2024-09-19)Embedded systems have become pervasive and increasingly complex, especially in modern applications such as self-driving vehicles, where safety requires both accurate functionality and real-time guarantees. However, the complexity and the integration of artificial intelligence and machine learning algorithms in autonomous systems challenge conventional test-based verification methods. Given the continuous evolution and deployment of these systems, verification must keep pace to ensure their reliability and safety. Runtime verification is a promising approach for validating system behaviors during execution using monitors derived from formal system specifications. The adoption of runtime monitoring has historically been limited to experts, primarily due to the esoteric formal notations and verification processes. To overcome this barrier, this dissertation presents GROOT, a novel methodology and framework designed to automate synthesis of hardware and/or software monitors from pseudo- English statements. The automatic steps include translating English properties to formalisms, converting the formalisms into monitor automata, and formally verifying the monitors. GROOT addresses the distinction between functional and timing requirements inherent in real-time embedded systems by providing distinct pseudo-English languages and synthesis flows. This dual approach allows customized verification processes for each category. To make the monitor structure simple, monitor inputs and responses are handled in separate external modules, allowing formal analysis methods to be used. The synthesized monitors can assist system development and be retained in fielded systems. Their lightweight nature enables the deployment of multiple monitors, each focusing on specific circumstances independently and concurrently. Monitor implementations can range from sequential software to parallel hardware, allowing for flexibility in meeting various system constraints. By eliminating the need for manual code generation and verification, GROOT allows practitioners to synthesize monitors without requiring a formal methods background.
- Automatic Scheduling of Compute Kernels Across Heterogeneous ArchitecturesLyerly, Robert Frantz (Virginia Tech, 2014-05-07)The world of high-performance computing has shifted from increasing single-core performance to extracting performance from heterogeneous multi- and many-core processors due to the power, memory and instruction-level parallelism walls. All trends point towards increased processor heterogeneity as a means for increasing application performance, from smartphones to servers. These various architectures are designed for different types of applications — traditional "big" CPUs (like the Intel Xeon) are optimized for low latency while other architectures (such as the NVidia Tesla K20x) are optimized for high-throughput. These architectures have different tradeoffs and different performance profiles, meaning fantastic performance gains for the right types of applications. However applications that are ill-suited for a given architecture may experience significant slowdown; therefore, it is imperative that applications are scheduled onto the correct processor. In order to perform this scheduling, applications must be analyzed to determine their execution characteristics. Traditionally this application-to-hardware mapping was determined statically by the programmer. However, this requires intimate knowledge of the application and underlying architecture, and precludes load-balancing by the system. We demonstrate and empirically evaluate a system for automatically scheduling compute kernels by extracting program characteristics and applying machine learning techniques. We develop a machine learning process that is system-agnostic, and works for a variety of contexts (e.g. embedded, desktop/workstation, server). Finally, we perform scheduling in a workload-aware and workload-adaptive manner for these compute kernels.
- Automatic Source Code Transformation To Pass Compiler OptimizationKahla, Moustafa Mohamed (Virginia Tech, 2024-01-03)Loop vectorization is a powerful optimization technique that can significantly boost the runtime of loops. This optimization depends on functional equivalence between the original and optimized code versions, a requirement typically established through the compiler's static analysis. When this condition is not met, the compiler will miss the optimization. The process of manually rewriting the source code to pass an already missed compiler optimization is time-consuming, given the multitude of potential code variations, and demands a high level of expertise, making it impractical in many scenarios. In this work, we propose a novel framework that aims to take the code blocks that the compiler failed to optimize and transform them to another code block that passes the compiler optimization. We develop an algorithm to efficiently search for a code structure that automatically passes the compiler optimization (weakly verified through a correctness test). We focus on loop-vectorize optimization inside OpenMP directives, where the introduction of parallelism adds complexity to the compiler's vectorization task and is shown to hinder optimizations. Furthermore, we introduce a modified version of TSVC, a loop vectorization benchmark in which all original loops are executed within OpenMP directives. Our evaluation shows that our framework enables " loop-vectorize" optimizations that the compiler failed to pass, resulting in a speedup up to 340× in the blocks optimized. Furthermore, applying our tool to HPC benchmark applications, where those applications are already built with optimization and performance in mind, demonstrates that our technique successfully enables extended compiler optimization, thereby accelerating the execution time of the optimized blocks in 15 loops and the entire execution time of the three applications by up to 1.58 times.
- Brief announcement: Crystalline: Fast and memory efficient wait-free reclamationNikolaev, Ruslan; Ravindran, Binoy (2021-10-01)We present a new wait-free memory reclamation scheme, Crystalline, that simultaneously addresses the challenges of high performance, high memory efficiency, and wait-freedom. Crystalline guarantees complete wait-freedom even when threads are dynamically recycled, asynchronously reclaims memory in the sense that any thread can reclaim memory retired by any other thread, and ensures (an almost) balanced reclamation workload across all threads. The latter two properties result in Crystalline’s high performance and high memory efficiency, a difficult trade-off for most existing schemes. Our evaluations show that Crystalline exhibits outstanding scalability and memory efficiency, and achieves superior throughput than state-of-the-art reclamation schemes as the number of threads grows.
- ByteSTM: Java Software Transactional Memory at the Virtual Machine LevelMahmoud Mohamedin, Mohamed Ahmed (Virginia Tech, 2012-02-08)As chip vendors are increasingly manufacturing a new generation of multi-processor chips called multicores, improving software performance requires exposing greater concurrency in software. Since code that must be run sequentially is often due to the need for synchronization, the synchronization abstraction has a significant effect on program performance. Lock-based synchronization — the most widely used synchronization method — suffers from programability, scalability, and composability challenges. Transactional memory (TM) is an emerging synchronization abstraction that promises to alleviate the difficulties with lock-based synchronization. With TM, code that read/write shared memory objects is organized as transactions, which speculatively execute. When two transactions conflict (e.g., read/write, write/write), one of them is aborted, while the other commits, yielding (the illusion of) atomicity. Aborted transactions are re-started, after rolling-back changes made to objects. In addition to a simple programming model, TM provides performance comparable to lock-based synchronization. Software transactional memory (STM) implements TM entirely in software, without any special hardware support, and is usually implemented as a library, or supported by a compiler or by a virtual machine. In this thesis, we present ByteSTM, a virtual machine-level Java STM implementation. ByteSTM implements two STM algorithms, TL2 and RingSTM, and transparently supports implicit transactions. Program bytecode is automatically modified to support transactions: memory load/store bytecode instructions automatically switch to transactional mode when a transaction starts, and switch back to normal mode when the transaction successfully commits. Being implemented at the VM-level, it accesses memory directly and uses absolute memory addresses to uniformly handle memory. Moreover, it avoids Java garbage collection (which has a negative impact on STM performance), by manually allocating and recycling memory for transactional metadata. ByteSTM uses field-based granularity, and uses the thread header to store transactional metadata, instead of the slower Java ThreadLocal abstraction. We conducted experimental studies comparing ByteSTM with other state-of-the-art Java STMs including Deuce, ObjectFabric, Multiverse, DSTM2, and JVSTM on a set of micro- benchmarks and macro-benchmarks. Our results reveal that, ByteSTM's transactional throughput improvement over competitors ranges from 20% to 75% on micro-benchmarks and from 36% to 100% on macro-benchmarks.
- Characterization of Sparsity-aware Optimization Paths for Graph Traversal on FPGAGondhalekar, Atharva (Virginia Tech, 2023-05-25)Breath-first search (BFS) is a fundamental building block in many graph-based applications, but it is difficult to optimize for a field-programmable gate array (FPGA) due to its irregular memory-access patterns. Prior work, based on hardware description languages (HDLs) and high-level synthesis (HLS), address the memory-access bottleneck of BFS by using techniques such as data alignment and compute-unit replication on FPGAs. The efficacy of such optimizations depends on factors such as the sparsity of target graph datasets. Optimizations intended for sparse graphs may not work as effectively for dense graphs on an FPGA and vice versa. This thesis presents two sets of FPGA optimization strategies for BFS, one for near-hypersparse graphs and the other designed for sparse to moderately dense graphs. For near-hypersparse graphs, a queue-based kernel with maximal use of local memory on FPGA is implemented. For denser graphs, an array-based kernel with compute-unit replication is implemented. Across a diverse collection of graphs, our OpenCL optimization strategies for near-hypersparse graphs delivers a 5.7x to 22.3x speedup over a state-of-the-art OpenCL implementation, when evaluated on an Intel Stratix~10 FPGA. The optimization strategies for sparse to moderately dense graphs deliver 1.1x to 2.3x speedup over a state-of-the-art OpenCL implementation on the same FPGA. Finally, this work uses graph metrics such as average degree and Gini coefficient to observe the impact of graph properties on the performance of the proposed optimization strategies.
- Collaborative Scheduling and Synchronization of Distributable Real-Time ThreadsFahmy, Sherif Fadel (Virginia Tech, 2010-05-05)In this dissertation, we consider the problem of scheduling and synchronization of distributable real-time threads --- Real-Time CORBA's first-class abstraction for programming real-time, multi-node sequential behaviors. Distributable real-time threads can be scheduled, broadly, using two paradigms: node independent scheduling, in which nodes independently construct thread schedules, based on node-level decomposition of distributable thread (or DT) scheduling parameters, and collaborative scheduling, in which nodes collaborate to construct system-wide thread schedules, which may or may not involve scheduling parameter decomposition. While significant literature exists on node independent scheduling, little is known about collaborative scheduling and its concomitant tradeoffs. We design three collaborative scheduling algorithms, called ACUA, QBUA, and DQBUA. ACUA uses theory of consensus and QBUA uses theory of quorums for distributable thread schedule construction. DQBUA extends QBUA with lock-based, local and distributed concurrency control. The algorithms consider a model where distributable threads arrive arbitrarily, have time/utility function time constraints, access resources in an arbitrary way (e.g., arbitrary lock acquire/release order, arbitrary nestings), and are subject to arbitrary node crash failures and message losses. We analytically establish several properties of the algorithms including probabilistic end-to-end termination time satisfactions, timeliness optimality during underloads, bounded exception handling time, and correctness of the algorithms in partially synchronous systems. We implement distributable real-time threads in the Linux kernel as a first-class programming and scheduling abstraction. The resulting kernel, called ChronOS, provides application interfaces for creating and manipulating distributable threads, as well as kernel interfaces and mechanisms for scheduling them (using both independent and collaborative approaches). ChronOS also has failure detector mechanisms for detecting and recovering from distributable thread failures. We implement the proposed scheduling algorithms and their competitors in ChronOS and compare their behavior. Our studies reveal that the collaborative scheduling algorithms are superior to independent scheduling algorithms for certain thread sets, in particular, when thread sections have significantly varying execution time. This variability, especially if the variability is not consistent among the threads, may cause each node to make conflicting decisions in the absence of global information. We observe that collaborative schedulers outperform independent schedulers (e.g., EDF augmented with PIP) in terms of accrued utility by as much as 75%. We identify distributed dependencies as one of the major sources of overhead in collaborative scheduling. In particular, the cost of distributed lock-based concurrency control (e.g., lock management, distributed deadlock detection/resolution) can significantly reduce the problem space for which collaborative scheduling is beneficial. To mitigate this, we consider the use of software transactional memory (or STM), an optimistic, non-blocking synchronization alternative to lock-based concurrency control which has been extensively studied in non real-time contexts. We consider distributable real-time threads with STM concurrency control, and develop techniques for analyzing and bounding their end-to-end response times on distributed single-processor and distributed multiprocessor systems. We also develop contention management techniques, a key component of STM, which are driven by threads' real-time scheduling parameters, and establish their tradeoffs against non-real-time contention managers.
- A Compiler Framework to Support and Exploit Heterogeneous Overlapping-ISA Multiprocessor PlatformsJelesnianski, Christopher Stanisław (Virginia Tech, 2015-10-28)As the demand for ever increasingly powerful machines continues, new architectures are sought to be the next route of breaking past the brick wall that currently stagnates the performance growth of modern multi-core CPUs. Due to physical limitations, scaling single-core performance any further is no longer possible, giving rise to modern multi-cores. However, the brick wall is now limiting the scaling of general-purpose multi-cores. Heterogeneous-core CPUs have the potential to continue scaling by reducing power consumption through exploitation of specialized and simple cores within the same chip. Heterogeneous-core CPUs join fundamentally different processors each which their own peculiar features, i.e., fast execution time, improved power efficiency, etc; enabling the building of versatile computing systems. To make heterogeneous platforms permeate the computer market, the next hurdle to overcome is the ability to provide a familiar programming model and environment such that developers do not have to focus on platform details. Nevertheless, heterogeneous platforms integrate processors with diverse characteristics and potentially a different Instruction Set Architecture (ISA), which exacerbate the complexity of the software. A brave few have begun to tread down the heterogeneous-ISA path, hoping to prove that this avenue will yield the next generation of super computers. However, many unforeseen obstacles have yet to be discovered. With this new challenge comes the clear need for efficient, developer-friendly, adaptable system software to support the efforts of making heterogeneous-ISA the golden standard for future high-performance and general-purpose computing. To foster rapid development of this technology, it is imperative to put the proper tools into the hands of developers, such as application and architecture profiling engines, in order to realize the best heterogeneous-ISA platform possible with available technology. In addition, it would be in the best interest to create tools to be as "timeless" as possible to expose fundamental concepts industry could benefit from and adopt in future designs. We demonstrate the feasibility of a compiler framework and runtime for an existing heterogeneous-ISA operating system (Popcorn Linux) for automatically scheduling compute blocks within an application on a given heterogeneous-ISA high-performance platform (in our case a platform built with Intel Xeon - Xeon Phi). With the introduced Profiler, Partitioner, and Runtime support, we prove to be able to automatically exploit the heterogeneity in an overlapping-ISA platform, being faster than native execution and other parallelism programming models. Empirically evaluating our compiler framework, we show that application execution on Popcorn Linux can be up to 52% faster than the most performant native execution for Xeon or Xeon Phi. Using our compiler framework relieves the developer from manual scheduling and porting of applications, requiring only a single profiling run per application.
- Compiler-Directed Error Resilience for Reliable ComputingLiu, Qingrui (Virginia Tech, 2018-08-08)Error resilience has become as important as power and performance in modern computing architecture. There are various sources of errors that can paralyze real-world computing systems. Of particular interest to this dissertation are single-event errors. They can be the results of energetic particle strike or abrupt power outage that corrupts the program states leading to system failures. Specifically, energetic particle strike is the major cause of soft error while abrupt power outage can result in memory inconsistency in the nonvolatile memory systems. Unfortunately, existing techniques to handle those single-event errors are either resource consuming (e.g., hardware approaches) or heavy-weight (e.g., software approaches). To address this problem, this dissertation identifies idempotent processing as an alternative recovery technique to handle the system failures in an efficient and low-cost manner. Then, this dissertation first proposes to design and develop a compiler-directed lightweight methodology which leverages idempotent processing and the state-of-the-art sensor-based detection to achieve soft error resilience at low-cost. This dissertation also introduces a lightweight soft error tolerant hardware design that redefines idempotent processing where the idempotent regions can be created, verified and recovered from the processor's point of view. Furthermore, this dissertation proposes a series of compiler optimizations that significantly reduce the hardware and runtime overhead of the idempotent processing. Lastly, this dissertation proposes a failure-atomic system integrated with idempotent processing to resolve another type of single-event error, i.e., failure-induced memory inconsistency in the nonvolatile memory systems.
- Constraint Solving for Diagnosing Concurrency BugsKhoshnood, Sepideh (Virginia Tech, 2015-05-28)Programmers often have to spend a significant amount of time inspecting the software code and execution traces to identify the root cause of a software bug. For a multithreaded program, debugging is even more challenging due to the subtle interactions between concurrent threads and the often astronomical number of possible interleavings. In this work, we propose a logical constraint-based symbolic analysis method to aid in the diagnosis of concurrency bugs and find their root causes, which can be later used to recommend repairs. In our method, the diagnosis process is formulated as a set of constraint solving problems. By leveraging the power of constraint satisfiability (SAT) solvers and a bounded model checker, we perform a semantic analysis of the sequential computation as well as the thread interactions. The analysis is ideally suited for handling software with small to medium code size but complex concurrency control, such as device drivers, synchronization protocols, and concurrent data structures. We have implemented our method in a software tool and demonstrated its effectiveness in diagnosing subtle concurrency bugs in multithreaded C programs.
- CRIU-RTX: Remote Thread eXecution using Checkpoint/Restore in UserspaceNoor Mohamed, Mohamed Husain (Virginia Tech, 2023-07-21)Scaling up application performance on single high-end machines is increasingly becoming difficult due to scalability challenges of processor interconnects, cache coherence protocols, and memory bandwidth. Significant prior work has addressed this problem by scaling-out application threads across multiple nodes to exploit resources outside the single machine boundary. Prior works have also leveraged heterogeneous instruction set architecture (ISA) systems to improve application performance as well as energy-efficiency, a major cost driver in datacenters, by augmenting high-end servers with power-efficient embedded boards. Existing works, however, suffer from deployability challenges due to dependencies on the operating system or programming models that require non-trivial application modifications. We introduce CRIU-RTX, a userspace framework to scale-out multi-threaded applications across multiple nodes. Integrated with HetMigrate, a prior work on migrating processes across heterogeneous-ISA systems, CRIU-RTX can suspend a subset of threads in a process and resume their execution on different nodes, including, but not limited to heterogeneous-ISA nodes. CRIU-RTX implements distributed shared memory in userspace, thereby allowing application threads to access distributed memory transparently without any operating system dependency. Our experimental evaluations show 21% to 43% performance gains while scaling-out applications across x86-64 servers, and energy efficiency gains of up to 18% while scaling-out across a cluster of x86-64 server and ARM64 embedded boards. Since CRIU-RTX does not depend on operating system modifications, it can be easily deployed on a diverse set of machines, including, but not limited to ISA-different machines running the stock Linux operating system.
- Cross-ISA Execution Migration of Unikernels: Build Toolchain, Memory Alignment, and VM State Transfer TechniquesMehrab, A K M Fazla (Virginia Tech, 2018-12-12)The data centers are composed of resource-rich expensive server machines. A server, overloadeded with workloads, offloads some jobs to other servers; otherwise, its throughput becomes low. On the other hand, low-end embedded computers are low-power, and cheap OS-capable devices. We propose a system to use these embedded devices besides the servers and migrate some jobs from the server to the boards to increase the throughput when overloaded. The datacenters usually run workloads inside virtual machines (VM), but these embedded boards are not capable of running full-fledged VMs. In this thesis, we propose to use lightweight VMs, called unikernel, which can run on these low-end embedded devices. Another problem is that the most efficient versions of these boards have different instruction set architectures than the servers have. The ISA-difference between the servers and the embedded boards and the migration of the entire unikernel between them makes the migration a non-trivial problem. This thesis proposes a way to provide the unikernels with migration capabilities so that it becomes possible to offload workloads from the server to the embedded boards. This thesis describes a toolchain development process for building migratable unikernel for the native applications. This thesis also describes the alignment of the memory components between unikernels for different ISAs, so that the memory referencing remains valid and consistent after migration. Moreover, this thesis represents an efficient VM state transfer method so that the workloads experience higher execution time and minimum downtime due to the migration.
- Design and Evaluation of an Embedded Real-time Micro-kernelSingh, Kuljeet (Virginia Tech, 2002-10-17)This thesis presents the design and evaluation of an operating system kernel specially designed for dataflow software. Dataflow is a style of software architecture that is well suited for control and "signal flow" applications. This architecture involves many small processes and lots of inter-process communication, which impose too much overhead on traditional RTOSes. This thesis describes design and implementation of the Dataflow Architecture Real-time Kernel (DARK). DARK is a reconfigurable, multithreaded and preemptive operating system kernel that introduces a special data-driven scheduling strategy for dataflow applications. It uses the underlying hardware for high-speed context switching between the kernel and applications, which is five times faster than the ordinary context switch. The features of the kernel can be configured according to performance requirements without change to the applications. Along with the performance evaluation of DARK, the performance comparison results of DARK with two commercial RTOSes: MicroC/OS-II and Analog Devices VDK++ are also provided.
- Design and Implementation of a Network Server in LibrettOSSung, Mincheol (Virginia Tech, 2018-12-13)Traditional network stacks in monolithic kernels have reliability and security concerns. Any fault in a network stack affects the entire system owing to lack of isolation in the monolithic kernel. Moreover, the large code size of the network stack enlarges the attack surface of the system. A multiserver OS design solves this problem. In contrast to the traditional network stack, a multiserver OS pushes the network stack into the network server as a user process, which performs three enhancements: (i) allows the network server to run in user mode while having its own address space and isolating any fault occurring in the network server; (ii) minimizes the attack surface of the system because the trusted computing base contracts; (iii) enables failure recovery, which is an important feature supported by a multiserver OS. This thesis proposes a network server for LibrettOS, an operating system based on rumprun unikernels and the Xen Hypervisor developed by Virginia Tech. The proposed network server is a service domain providing an L2 frame forwarding service for application domains and based on rumprun such that the existing device drivers of NetBSD can be leveraged with little modification. In this model, the TCP/IP stack runs directly in the address space of applications. This allows retaining the client state even if the network server crashes and makes it possible to recover from a network server failure. We leverage the Xen PCI passthrough to access a NIC (Network Interface Controller) from the network server. Our experimental evaluation demonstrates that the performance of the network server is good and comparable with Linux and NetBSD. We also demonstrate the successful recovery after a failure.
- Design Optimization Techniques for Time-Critical Cyber-Physical SystemsZhao, Yecheng (Virginia Tech, 2020-01-20)Cyber-Physical Systems (CPS) are widely deployed in critical applications which are subject to strict timing constraints. To ensure correct timing behavior, much of the effort has been dedicated to the development of validation and verification methods for CPS (e.g., system models and their timing and schedulability analysis). As CPS is becoming increasingly complex, there is an urgent need for efficient optimization techniques that can aid the design of large-scale systems. Specifically, techniques that can find good design options in a reasonable amount of time while meeting all the timing and other critical requirements are becoming vital. However, the current mindset is to use existing schedulability analysis and optimization techniques for the design optimization of time-critical CPS. This has resulted in two issues in today's CPS design: 1) Existing timing and schedulability analysis are very difficult and inefficient to be integrated into well-established optimization frameworks such as mathematical programming; 2) New system models and timing analysis are being developed in a way that is increasingly unfriendly to optimization. Due to these difficulties, existing practice for optimization mostly relies on meta or ad-hoc heuristics, which suffers either from sub-optimality or limited applicability. In this dissertation, we seek to address these issues and explore two new directions for developing optimization algorithms for time-critical CPS. The first is to develop {em optimization-oriented timing analysis}, that are efficient to formulate in mathematical programming framework. The second is a domain-specific optimization framework. The framework leverages domain-specific knowledge to provide methods that abstract timing analysis into a simple mathematical form. This allows to efficiently handle the complexity of timing analysis in optimization algorithms. The results on a number of case studies show that the proposed approaches have the potential to significantly improve upon scalability (several orders of magnitude faster) and solution quality, while being applicable to various system models, timing analysis techniques, and design optimization problems in time-critical CPS.
- Designing, Modeling, and Optimizing Transactional Data StructuresHassan, Ahmed Mohamed Elsayed (Virginia Tech, 2015-09-25)Transactional memory (TM) has emerged as a promising synchronization abstraction for multi-core architectures. Unlike traditional lock-based approaches, TM shifts the burden of implementing threads synchronization from the programmer to an underlying framework using hardware (HTM) and/or software (STM) components. Although TM can be leveraged to implement transactional data structures (i.e., those where multiple operations are allowed to execute atomically, all-or-nothing, according to the transaction paradigm), its intensive speculation may result in significantly lower performance than the optimized concurrent data structures. This poor performance motivates the need to find other, more effective, alternatives for designing transactional data structures without losing the simple programming abstraction proposed by TM. To do so, we identified three major challenges that need to be addressed to design efficient transactional data structures. The first challenge is composability, namely allowing an atomic execution of two or more data structure operations in the same way as TM provides, but without its high overheads. The second challenge is integration, which enables the execution of data structure operations within generic transactions that may contain other memory- based operations. The last challenge is modeling, which encompasses the necessity of defining a unified formal methodology to reason about the correctness of transactional data structures. In this dissertation, we propose different approaches to address the above challenges. First, we address the composability challenge by introducing an optimistic methodology to effi- ciently convert concurrent data structures into transactional ones. Second, we address the integration challenge by injecting the semantic operations of those transactional data struc- ture into TM frameworks, and by presenting two novel STM algorithms in order to enhance the overall performance of those frameworks. Finally, we address the modeling challenge by presenting two models for concurrent and transactional data structures designs. • Our first main contribution in this dissertation is Optimistic transactional boosting (OTB), a methodology to design transactional versions of the highly concurrent optimistic (i.e., lazy) data structures. An earlier (pessimistic) boosting proposal added a layer of abstract locks on top of existing concurrent data structures. Instead, we propose an optimistic boosting methodology, which allows greater data structure-specific optimizations, easier integration with TM frameworks, and lower restrictions on the operations than the original (more pessimistic) boosting methodology. Based on the proposed OTB methodology, we implement the transactional version of two list-based data structures (i.e., set and priority queue). Then, we present TxCF-Tree, a balanced tree whose design is optimized to support transactional accesses. The core optimizations of TxCF-Tree's operations are: providing a traversal phase that does not use any lock and/or speculation and deferring the lock acquisition or physical modification to the transaction's commit phase; isolating the structural operations (such as re-balancing) in an interference-less housekeeping thread; and minimizing the interference between structural operations and the critical path of semantic operations (i.e., additions and removals on the tree). • Our second main contribution is to integrate OTB with both STM and HTM algorithms. For STM, we extend the design of both DEUCE, a Java STM framework, and RSTM, a C++ STM framework, to support the integration with OTB. Using our extension, programmers can include both OTB data structure operations and traditional memory reads/writes in the same transaction. Results show that OTB performance is closer to the optimal lazy (non-transactional) data structures than the original boosting algorithm. On the HTM side, we introduce a methodology to inject semantic operations into the well-known hybrid transactional memory algorithms (e.g., HTM-GL, HyNOrec, and NOre- cRH). In addition, we enhance the proposed semantically-enabled HTM algorithms with a lightweight adaptation mechanism that allows bypassing the HTM paths if the overhead of the semantic operations causes repeated HTM aborts. Experiments on micro- and macro- benchmarks confirm that our proposals outperform the other TM solutions in almost all the tested workloads. • Our third main contribution is to enhance the performance of TM frameworks in gen- eral by introducing two novel STM algorithms. Remote Transaction Commit (RTC) is a mechanism for executing commit phases of STM transactions in dedicated server cores. RTC shows significant improvements compared to its corresponding validation based STM algorithm (up to 4x better) as it decreases the overhead of spin locking during commit, in terms of cache misses, blocking of lock holders, and CAS operations. Remote Inval- idation (RInval) applies the same idea of RTC on invalidation based STM algorithms. Furthermore, it allows more concurrency by executing commit and invalidation routines concurrently in different servers. RInval performs up to 10x better than its corresponding invalidation based STM algorithm (InvalSTM), and up to 2x better than its corresponding validation-based algorithm (NOrec). • Our fourth and final main contribution is to provide a theoretical model for concurrent and transactional data structures. We exploit the similarities of the OTB-based data structures and provide a unified model to reason about the correctness of those designs. Specifically, we extend a recent approach that models data structures with concurrent readers and a single writer (called SWMR), and we propose two novel models that additionally allow multiple writers and transactional execution. Those models are more practical because they cover a wider set of data structures than the original SWMR model.