Programming Abstractions for Scalable and High-Performance Memory Disaggregation

TR Number

Date

2025-10-22

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Memory-intensive applications, such as in-memory analytics, databases, and caching, are driving the memory demand in datacenters. Unfortunately, DRAM scaling is challenging as the technology reaches its physical limit. Although its demand continues to rise, memory is an underutilized resource within datacenters while simultaneously accounting for a significant portion of the total infrastructure cost. Consequently, memory capacity and performance have become significant bottlenecks, impacting both the overall system cost and efficiency.

To address the aforementioned inefficiency, memory disaggregation has emerged as a promising solution. Memory disaggregation physically separates compute from memory, allowing each resource to be separately provisioned to enable independent resource scaling. Applications run at the compute nodes that have a small amount of memory and access the disaggregated memory over a fast network/fabric technology, such as Remote Direct Memory Access (RDMA), and the recent Compute Express Link (CXL). The disaggregated architecture is promising as it can efficiently utilize the compute and memory resources.

However, despite its promise, memory disaggregation must overcome several challenges to enable its widespread adoption. First, the scalability of the disaggregated architecture depends on the scalability of the components driving it. RDMA network interconnects are a key enabler to make memory disaggregation practical; hence, a scalable RDMA communication framework is imperative to achieving system scalability. However, RDMA performance does not scale with the cluster size, pointing towards a fundamental trade-off between performance and scalability in RDMA-capable networks. Second, memory disaggregation is a paradigm shift from the conventional model of programming with machine-local memory as the application state could be distributed across disaggregated memory. A lack of programming abstractions that offer easy-to-use interfaces would limit the applicability of memory disaggregation. Finally, apart from the efficiency gains, performance is crucial to the adoption of memory disaggregation; hence, the system must deliver good performance in this architecture.

This dissertation aims to overcome the challenges listed above. I will first describe my prior work, FLOCK, which achieves scalability and high performance in RDMA-capable networks. FLOCK exposes a connection handle abstraction and uses it to transparently coalesce network messages at the application to efficiently utilize the network bandwidth and reduce CPU cycles spent on network I/O. FLOCK introduces a symbiotic load-control scheduling policy that enables a server to dynamically control the maximum network load it wants to handle, which is key to achieving scalability.

This dissertation also proposes a programmable, scalable, and high-performance disaggregated memory system, SONIC. In this work, we offer programming abstractions that are user-friendly and general to make programming for disaggregated memory similar to working with machine-local memory. These abstractions can be applied to a wide variety of applications, demonstrating their generality and benefiting many applications from disaggregated memory. The key idea in SONIC is using transactions that enable applications to access disaggregated memory with location transparency. We present multiple commit protocols for efficient transaction processing in our system. Accessing disaggregated memory over the network is slower than local memory; hence, our design incorporates several optimizations to reduce the network round trips and deliver good performance, demonstrating that general-purpose programming abstractions can deliver good performance in the disaggregated architecture.

Finally, this dissertation presents directions for future research on memory disaggregation. I will discuss my ongoing work on extending SONIC to leverage application knowledge for better remote memory management that can improve application performance while efficiently utilizing system resources (compute, memory, and network). Furthermore, I will discuss the challenges and interesting problems that memory disaggregation brings, such as fault tolerance, and how to address them.

Description

Keywords

Datacenter networking, Remote Direct Memory Access (RDMA), Distributed Systems, Memory Disaggregation, Scalability

Citation