On Optimizing and Leveraging Distributed Shared Memory for High Performance, Resource Aggregation, and Cache-coherent Heterogeneous-ISA Processors

dc.contributor.authorChuang, Ho-Renen
dc.contributor.committeechairRavindran, Binoyen
dc.contributor.committeememberWang, Hainingen
dc.contributor.committeememberOlivier, Pierreen
dc.contributor.committeememberZeng, Haiboen
dc.contributor.committeememberJian, Xunen
dc.contributor.departmentElectrical and Computer Engineeringen
dc.date.accessioned2022-06-29T08:00:44Zen
dc.date.available2022-06-29T08:00:44Zen
dc.date.issued2022-06-28en
dc.description.abstractThis dissertation focuses on the problem space of heterogeneous-ISA multiprocessors – an architectural design point that is being studied by the academic research community and increasingly available in commodity systems. Since such architectures usually lack globally coherent shared memory, software-based distributed shared memory (DSM) is often used to provide the illusion of such a memory. The DSM abstraction typically provides this illusion using a reader-replicate, writer-invalidate memory consistency protocol that operates at the granularity of memory pages and is usually implemented as a first-class operating system abstraction. This enables symmetric multiprocessing (SMP) programming frameworks, augmented with a heterogeneous-ISA compiler, to use CPU cores of different ISAs for parallel computations as if they are of the same ISA, improving programmability, especially for legacy SMP applications which therefore can run unmodified on such hardware. Past DSMs have been plagued by poor performance, in part due to the high latency and low bandwidth of interconnect network infrastructures. The dissertation revisits DSM in light of modern interconnects that reverse this performance trend. The dissertation presents Xfetch, a bulk page prefetching mechanism designed for the DEX DSM system. Xfetch exploits spatial locality, and aggressively and sequentially prefetches pages before potential read faults, improving DSM performance. Our experimental evaluations reveal that Xfetch achieves up to ≈142% speedup over the baseline DEX DSM that does not prefetch page data. SMP programming models often allow primitives that permit weaker memory consistency semantics, where synchronization updates can be delayed, permitting greater parallelism and thereby higher performance. Inspired by such primitives, the dissertation presents a DSM protocol called MWPF that trades-off memory consistency for higher performance in select SMP code regions, targeting heterogeneous-ISA multiprocessor systems. MWPF also overcomes performance bottlenecks of past DSM systems for heterogeneous-ISA multiprocessors such as due to significant number of invalidation messages, false page sharing, large number of read page faults, and large synchronization overheads by using efficient protocol primitives that delay and batch invalidation messages, aggressively prefetch data pages, and perform cross-domain synchronization with low overhead. Our experimental evaluations reveal that MWPF achieves, on average, 11% speedup over the baseline DSM implementation. The dissertation presents PuzzleHype, a distributed hypervisor that enables a single virtual machine (VM) to use fragmented resources in distributed virtualized settings such as CPU cores, memory, and devices of different physical hosts, and thereby decrease resource fragmentation and increase resource utilization. PuzzleHype leverages DSM implemented in host operating systems to present an unified and consistent view of a continuous pseudo-physical address space to guest operating systems. To transparently utilize CPU and I/O resources, PuzzleHype integrates multiple physical CPUs into a single VM by migrating threads, forwarding interrupts, and by delegating I/O. Our experimental evaluations reveal that PuzzleHype yields speedups in the range of 355%–173% over baseline over-provisioning scenarios which are otherwise necessary due to resource fragmentation. To enable a distributed hypervisor to adapt to resource and workload changes, the dissertation proposes the concept of CPU borrowing that allows a VM's virtual CPU (vCPU) to migrate to an available physical CPU (pCPU) and release it when it is no longer necessary, i.e., CPU returning. CPU borrowing can thus be used when a node is over-committed, and CPU returning can be used when the borrowed CPU resource is no longer necessary. To transparently migrate a vCPU at runtime without incurring a significant downtime, the dissertation presents a suite of techniques including leveraging thread migration, loading/restoring vCPU in KVM states, maintaining a global vCPU location table, and creating a DSM kernel thread for handling on-demand paging. Our experimental evaluations reveal that migrating vCPUs to resource-available nodes achieves a speedup of 1.4x over running the vCPUs on distributed nodes. When a VM spans multiple nodes, it is likelihood for failure increases. To mitigate this, the dissertation presents a distributed checkpoint/restart mechanism that allows a distributed VM to tolerate failures. A user interface is introduced for sending/receiving checkpoint/restart commands to a distributed VM. We implement the checkpoint/restart technique in the native KVM tool, and extend it to a distributed mode by converting Inter-Process Communication (IPC) into message passing between nodes, pausing/resuming distributed vCPU executions, and loading/restoring runtime states on the correct set of nodes. Our experimental evaluations indicate that the overhead of checkpointing a distributed VM is ≈10% or less than that of the native KVM tool with our checkpoint support. Restarting a distributed VM is faster than native KVM with our restart support because no additional page faults occur during restarting. The dissertation's final contribution is PopHype, a system software stack that allows simulation of cache-coherent, shared memory heterogeneous-ISA hardware. PopHype includes a Linux operating system that implements DSM as an OS abstraction for processes, i.e., allows multiple processes running on multiple (ISA-different) machines to share memory. With KVM-enabled, this OS becomes a hypervisor that allows multiple, process-based instances of an architecture emulator such as QEMU to execute in a shared address space, allowing multiple QEMU instances to emulate different ISAs in shared memory, i.e., emulate shared memory heterogeneous-ISA hardware. PopHype also includes a modified QEMU to use process-level DSM and an optimized guest OS kernel for improved performance. Our experimental studies confirm PopHype's effectiveness, and reveal that PopHype achieves an average speedup of 7.32x over a baseline that runs multiple QEMU instances in shared memory atop a single host OS.en
dc.description.abstractgeneralComputing devices are ubiquitous around us. Each of these devices is powered by specialized chips called processors. These processors take in instructions, process them, and produce output. Such processing is what enables us, humans, to send messages to our loved ones, take photographs, as well as carry out various business functions such as using spreadsheet software. The kinds of instructions these processors execute are classified into so-called Instruction Set Architectures or ISAs. Chip designers build processors adopting different ISAs for various applications ranging from computing on mobile phones to cloud computing data centers used by large technology companies. Within a data center, there are typically hundreds of thousands of computing devices that serve an organization's purpose to serve millions or even billions of users. Programming these computers individually to serve a collective goal is an arduous task requiring hundreds of software engineering experts. To simplify programming these computers on a large scale, this thesis envisions an abstraction where tens of devices appear as one computing unit to the programmer, allowing them to program multiple computers as if they are one. This allows for better resource utilization in the sense that the power of multiple computing devices can be pooled together without the need to acquire newer, larger, and more-expensive computers. Furthermore, such pooling allows the software to leverage multiple different ISAs on different computers instead of a single ISA on one computer. This thesis also envisions a way for software to run on multiple computers with potentially different ISAs without exposing the difficulty of managing them to the software engineers.en
dc.description.degreeDoctor of Philosophyen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:34694en
dc.identifier.urihttp://hdl.handle.net/10919/110968en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectDistributed Shared Memoryen
dc.subjectVirtualizationen
dc.subjectScale-outen
dc.subjectNon-cache-coherent Domainsen
dc.titleOn Optimizing and Leveraging Distributed Shared Memory for High Performance, Resource Aggregation, and Cache-coherent Heterogeneous-ISA Processorsen
dc.typeDissertationen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.nameDoctor of Philosophyen

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Chuang_H_D_2022.pdf
Size:
3.48 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
Chuang_H_D_2022_support_1.pdf
Size:
42.62 KB
Format:
Adobe Portable Document Format
Description:
Supporting documents