Browsing by Author "Albahar, Hadeel"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- Building A Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud StorageXu, Peng; Zhao, Nannan; Wan, Jiguang; Liu, Wei; Chen, Shuning; Zhou, Yuanhui; Albahar, Hadeel; Liu, Hanyang; Tang, Liu; Tan, Zhihu (ACM, 2022-05-25)The explosive growth of modern web-scale applications has made cost-effectiveness a primary design goal for their underlying databases. As a backbone of modern databases, LSM-tree based key-value stores (LSM store) face limited storage options. They are either designed for local storage that is relatively small, expensive, and fast or for cloud storage that offers larger capacities at reduced costs but slower. Designing an LSM store by integrating local storage with cloud storage services is a promising way to balance the cost and performance. However, such design faces challenges such as data reorganization, metadata overhead, and reliability issues. In this paper, we propose RocksMash, a fast and efficient LSM store that uses local storage to store frequently accessed data and metadata while using cloud to hold the rest of the data to achieve cost-effectiveness. To improve metadata space-efficiency and read performance, RocksMash uses an LSM-aware persistent cache that stores metadata in a space-efficient way and stores popular data blocks by using compaction-aware layouts. Moreover, RocksMash uses an extended write-ahead log for fast parallel data recovery. We implemented RocksMash by embedding these designs into RocksDB. The evaluation results show that RocksMash improves the performance by up to 1.7x compared to the state-of-the-art schemes and delivers high reliability, cost-effectiveness, and fast recovery.
- An End-to-End High-performance Deduplication Scheme for Docker Registries and Docker Container Storage SystemsZhao, Nannan; Lin, Muhui; Albahar, Hadeel; Paul, Arnab K.; Huan, Zhijie; Abraham, Subil; Chen, Keren; Tarasov, Vasily; Skourtis, Dimitrios; Anwar, Ali; Butt, Ali R. (ACM, 2024)The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place high pressure on the infrastructure of container registries that store and distribute images and container storage systems on the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost. We propose DupHunter, an end-to-end deduplication that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into 3 tiers: Docker registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the Docker registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier that utilizes the deduplication engine to eliminate the redundant files from the client tier, which avoids introducing deduplication overhead to the Docker client side. To further reduce the data duplicates caused by CoW and improve the container I/O performance, we use a container-aware backing file system at the client tier that preallocates space for each container and ensures that files in a container and its modifications are placed and redirected closer on the disk to maintain locality. Under real workloads, DupHunter reduces storage space by up to 6.9× and reduces the GET layer latency by up to 2.8× compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.
- Integration and Implementation (INT) CS5604 Fall 2019Agarwal, Rahul; Albahar, Hadeel; Roth, Eric; Sen, Malabika; Yu, Lixing (Virginia Tech, 2019-12-11)The first major goal of this project is to build a state-of-the-art information storage, retrieval, and analysis system that utilizes the latest technology and industry methods. This system is leveraged to accomplish the second major goal, supporting modern search and browse capabilities for two major content collections: (1) 200,000 ETDs (electronic theses and dissertations), and (2) 14 million settlement documents from the lawsuit wherein 39 U.S. states sued the major tobacco companies. The backbone of the information system is a Docker container cluster running with Rancher and Kubernetes. Information retrieval and visualization is accomplished with containers for Elasticsearch and Kibana, respectively. In addition to traditional searching and browsing, the system supports full-text and metadata searching. Search results include facets as a modern means of browsing among related documents. The system exercises text analysis and machine learning to reveal new properties of collection data. These new properties assist in the generation of available facets. Recommendations are also presented with search results based on associations among documents and with logged user activity. The information system is co-designed by 6 teams of Virginia Tech graduate students, all members of the same computer science class, CS 5604. Although the project is an academic exercise, it is the practice of the teams to work and interact as though they are groups within a company developing a product. These are the teams on this project: Collection Management ETDs (CME), Collection Management Tobacco Settlement Documents (CMT), Elasticsearch (ELS), Front-end and Kibana (FEK), Integration and Implementation (INT), and Text Analysis and Machine Learning (TML). This submission focuses on the work of the Integration (INT) team, which creates and administers Docker containers for each team in addition to administering the cluster infrastructure. Each container is a customized application environment that is specific to the needs of the corresponding team. For example, the ELS team container environment shall include Elasticsearch with its internal associated database. INT also administers the integration of the Ceph data storage system into the CS Department Cloud and provides support for interactions between containers and Ceph. During formative stages of development, INT also has a role in guiding team evaluations of prospective container components. Beyond the project formative stages, INT has the responsibility of deploying containers in a development environment according to mutual specifications agreed upon with each team. The development process is fluid. INT services team requests for new containers and updates to existing containers in a continuous integration process until the first system testing environment is completed. During the development stage INT also collaborates with the CME and CMT teams on the data pipeline subsystems for the ingestion and processing of new collection documents. With the testing environment established, the focus of the INT team shifts toward gathering of system performance data and making any systemic adjustments necessary based on the analysis of testing results. Finally, INT provides a production distribution that includes all embedded Docker containers and sub-embedded Git source code repositories. INT archives this distribution on Docker Hub and deploys it on the Virginia Tech CS Cloud.