Browsing by Author "Olivier, Pierre"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
- Aggregate VM: Why Reduce or Evict VM's Resources When You Can Borrow Them From Other Nodes?Chuang, Ho-Ren; Manaouil, Karim; Xing, Tong; Barbalace, Antonio; Olivier, Pierre; Heerekar, Balvansh; Ravindran, Binoy (ACM, 2023-05-08)Hardware resource fragmentation is a common issue in data centers. Traditional solutions based on migration or overcommitment are unacceptably slow, and modern commercial or research solutions like Spot VM may reduce or evict VM’s resources anytime.We propose an alternative solution that does not suffer from these drawbacks, the Aggregate VM.We introduce a new distributed hypervisor design, the resource-borrowing hypervisor, which creates Aggregate VMs: distributed VMs that temporarily aggregate fragmented resources belonging to different host machines, which require mobility of virtual CPUs, memory and IO devices.We implement a prototype, FragVisor, which runs guest software transparently.We also propose minimal modifications to the guest OS that can enable significant performance gains. We evaluate FragVisor over a set of microbenchmarks and IaaS-style real applications. Although Aggregate VMs are not a perfect fit for every type of applications, some workloads enjoy significant speedups compared to overcommitted scenarios (up to 3.9x with 4 distributed vCPUs).We further demonstrate that FragVisor is faster than a state-of-the-art competitor, GiantVM (up to 2.5x).
- Offloading Datacenter Jobs to RISC-V Hardware for Improved Performance and Power EfficiencyHeerekar, Balvansh; Philippidis, Cesar; Chuang, Ho-Ren; Olivier, Pierre; Barbalace, Antonio; Ravindran, Binoy (ACM, 2024-09-16)The end of Moore’s Law has brought significant changes in the architecture of servers used in data centers, increasingly incorporating new ISAs beyond x86-64 as well as diverse accelerators. Further, single-board computers have become increasingly efficient and can run certain Linux applications at significantly lower equipment and energy costs compared to traditional servers. Past research has demonstrated that offloading applications at runtime from x86-based servers to ARM-based single-board computers can result in increases in throughput and energy efficiency. The RISC-V architecture has recently gained significant commercial interest, and OS-capable single-board computers with RISC-V cores are increasingly available at the commodity scale. In this paper we propose a system that offloads jobs from an x86 server to a RISC-V single-board computer at runtime, with the goals of improving job throughput and energy saved. Towards this, we port the Popcorn Linux multi-ISA toolchain and runtime framework to RISC-V, enabling the live migration of applications between an x86 Xeon server and a SiFive HiFive RISC-V board. We further propose a scheduling policy, Lowest Slowdown First (LSF) that drives the offloading of long-running and stateful datacenter background jobs from the server to the board, to alleviate workload congestion on the server. LSF’s policy relies on monitoring jobs’ performance on the server, predicting the slowdown they would suffer if running on the board, and migrating the jobs with the lowest estimated slowdown. Our evaluation shows that LSF yields up to 20% increase in throughput while also gaining 16% more energy efficiency for compute-intensive workloads.
- On Optimizing and Leveraging Distributed Shared Memory for High Performance, Resource Aggregation, and Cache-coherent Heterogeneous-ISA ProcessorsChuang, Ho-Ren (Virginia Tech, 2022-06-28)This dissertation focuses on the problem space of heterogeneous-ISA multiprocessors – an architectural design point that is being studied by the academic research community and increasingly available in commodity systems. Since such architectures usually lack globally coherent shared memory, software-based distributed shared memory (DSM) is often used to provide the illusion of such a memory. The DSM abstraction typically provides this illusion using a reader-replicate, writer-invalidate memory consistency protocol that operates at the granularity of memory pages and is usually implemented as a first-class operating system abstraction. This enables symmetric multiprocessing (SMP) programming frameworks, augmented with a heterogeneous-ISA compiler, to use CPU cores of different ISAs for parallel computations as if they are of the same ISA, improving programmability, especially for legacy SMP applications which therefore can run unmodified on such hardware. Past DSMs have been plagued by poor performance, in part due to the high latency and low bandwidth of interconnect network infrastructures. The dissertation revisits DSM in light of modern interconnects that reverse this performance trend. The dissertation presents Xfetch, a bulk page prefetching mechanism designed for the DEX DSM system. Xfetch exploits spatial locality, and aggressively and sequentially prefetches pages before potential read faults, improving DSM performance. Our experimental evaluations reveal that Xfetch achieves up to ≈142% speedup over the baseline DEX DSM that does not prefetch page data. SMP programming models often allow primitives that permit weaker memory consistency semantics, where synchronization updates can be delayed, permitting greater parallelism and thereby higher performance. Inspired by such primitives, the dissertation presents a DSM protocol called MWPF that trades-off memory consistency for higher performance in select SMP code regions, targeting heterogeneous-ISA multiprocessor systems. MWPF also overcomes performance bottlenecks of past DSM systems for heterogeneous-ISA multiprocessors such as due to significant number of invalidation messages, false page sharing, large number of read page faults, and large synchronization overheads by using efficient protocol primitives that delay and batch invalidation messages, aggressively prefetch data pages, and perform cross-domain synchronization with low overhead. Our experimental evaluations reveal that MWPF achieves, on average, 11% speedup over the baseline DSM implementation. The dissertation presents PuzzleHype, a distributed hypervisor that enables a single virtual machine (VM) to use fragmented resources in distributed virtualized settings such as CPU cores, memory, and devices of different physical hosts, and thereby decrease resource fragmentation and increase resource utilization. PuzzleHype leverages DSM implemented in host operating systems to present an unified and consistent view of a continuous pseudo-physical address space to guest operating systems. To transparently utilize CPU and I/O resources, PuzzleHype integrates multiple physical CPUs into a single VM by migrating threads, forwarding interrupts, and by delegating I/O. Our experimental evaluations reveal that PuzzleHype yields speedups in the range of 355%–173% over baseline over-provisioning scenarios which are otherwise necessary due to resource fragmentation. To enable a distributed hypervisor to adapt to resource and workload changes, the dissertation proposes the concept of CPU borrowing that allows a VM's virtual CPU (vCPU) to migrate to an available physical CPU (pCPU) and release it when it is no longer necessary, i.e., CPU returning. CPU borrowing can thus be used when a node is over-committed, and CPU returning can be used when the borrowed CPU resource is no longer necessary. To transparently migrate a vCPU at runtime without incurring a significant downtime, the dissertation presents a suite of techniques including leveraging thread migration, loading/restoring vCPU in KVM states, maintaining a global vCPU location table, and creating a DSM kernel thread for handling on-demand paging. Our experimental evaluations reveal that migrating vCPUs to resource-available nodes achieves a speedup of 1.4x over running the vCPUs on distributed nodes. When a VM spans multiple nodes, it is likelihood for failure increases. To mitigate this, the dissertation presents a distributed checkpoint/restart mechanism that allows a distributed VM to tolerate failures. A user interface is introduced for sending/receiving checkpoint/restart commands to a distributed VM. We implement the checkpoint/restart technique in the native KVM tool, and extend it to a distributed mode by converting Inter-Process Communication (IPC) into message passing between nodes, pausing/resuming distributed vCPU executions, and loading/restoring runtime states on the correct set of nodes. Our experimental evaluations indicate that the overhead of checkpointing a distributed VM is ≈10% or less than that of the native KVM tool with our checkpoint support. Restarting a distributed VM is faster than native KVM with our restart support because no additional page faults occur during restarting. The dissertation's final contribution is PopHype, a system software stack that allows simulation of cache-coherent, shared memory heterogeneous-ISA hardware. PopHype includes a Linux operating system that implements DSM as an OS abstraction for processes, i.e., allows multiple processes running on multiple (ISA-different) machines to share memory. With KVM-enabled, this OS becomes a hypervisor that allows multiple, process-based instances of an architecture emulator such as QEMU to execute in a shared address space, allowing multiple QEMU instances to emulate different ISAs in shared memory, i.e., emulate shared memory heterogeneous-ISA hardware. PopHype also includes a modified QEMU to use process-level DSM and an optimized guest OS kernel for improved performance. Our experimental studies confirm PopHype's effectiveness, and reveal that PopHype achieves an average speedup of 7.32x over a baseline that runs multiple QEMU instances in shared memory atop a single host OS.
- SlimGuard: Design and Implementation of a Memory Efficient and Secure Heap AllocatorLiu, Beichen (Virginia Tech, 2020-01-03)Attacks on the heap are an increasingly severe threat. State-of-the-art secure dynamic memory allocators can offer protection, however their memory consumption is high, making them suboptimal in many situations. We introduce sys, a secure allocator whose design is driven by memory efficiency. Among other features, sys uses an efficient fine-grain size classes indexing mechanism and implements a novel dynamic canary scheme. It offers a low memory overhead due its size classes optimized for canary usage, its on-demand metadata allocation, and the combination of randomized allocations and over-provisioning into a single memory efficient security feature. sys protects against widespread heap-related attacks such as overflows, over-reads, double/invalid free, and use-after-free. Evaluation over a wide range of applications shows that it offers a significant reduction in memory consumption compared to the state-of-the-art secure allocator (up to 2x in macro-benchmarks), while offering similar or better security guarantees and good performance.
- Xar-Trek: Run-time Execution Migration among FPGAs and Heterogeneous-ISA CPUsHorta, Edson; Chuang, Ho-Ren; VSathish, Naarayanan Rao; Philippidis, Cesar; Barbalace, Antonio; Olivier, Pierre; Ravindran, Binoy (ACM, 2021-12-06)Datacenter servers are increasingly heterogeneous: from x86 host CPUs, to ARM or RISC-V CPUs in NICs/SSDs, to FPGAs. Previous works have demonstrated that migrating application execution at run-time across heterogeneous-ISA CPUs can yield significant performance and energy gains, with relatively little programmer effort. However, FPGAs have often been overlooked in that context: hardware acceleration using FPGAs involves statically implementing select application functions, which prohibits dynamic and transparent migration. We present Xar-Trek, a new compiler and run-time software framework that overcomes this limitation. Xar-Trek compiles an application for several CPU ISAs and select application functions for acceleration on an FPGA, allowing execution migration between heterogeneous-ISA CPUs and FPGAs at run-time. Xar-Trek’s run-time monitors server workloads and migrates application functions to an FPGA or to heterogeneous-ISA CPUs based on a scheduling policy. We develop a heuristic policy that uses application workload profiles to make scheduling decisions. Our evaluations conducted on a system with x86-64 server CPUs, ARM64 server CPUs, and an Alveo accelerator card reveal 88%-1% performance gains over no-migration baselines.