Towards SLO-aware Resource Scheduling for Serverless Inference Workloads

Tripathy, Abhijit

Towards SLO-aware Resource Scheduling for Serverless Inference Workloads

dc.contributor.author	Tripathy, Abhijit	en
dc.contributor.committeechair	Butt, Ali R.	en
dc.contributor.committeemember	Rafique, Muhammad Mustafa	en
dc.contributor.committeemember	Nikolopoulos, Dimitrios S.	en
dc.contributor.department	Computer Science and Applications	en
dc.date.accessioned	2023-08-09T08:00:21Z	en
dc.date.available	2023-08-09T08:00:21Z	en
dc.date.issued	2023-08-08	en
dc.description.abstract	The rapid advancement of Machine Learning (ML) and Deep Learning (DL) has revolutionized various domains, necessitating efficient and cost-effective ML inference capabilities. Function-as-a-Service (FaaS) has emerged as a promising approach for hosting ML inference services, providing a serverless computing environment that streamlines development cycles and offers scalability and simplified infrastructure management. However, existing autoscaling strategies employed by popular FaaS platforms often overlook critical factors such as response time and tail latency. Additionally, Python's Global Interpreter Lock (GIL) poses challenges for parallel computing in high-request traffic scenarios. This thesis addresses the need for efficient and cost-effective Machine Learning (ML) inference capabilities by exploring batching and autoscaling strategies for Serverless Inference instances. The study proposes a prototype FaaS framework that provides adaptive request batching, reactive autoscaling policies, and SLO monitoring, thus allowing Serverless Inference workloads to meet their SLO targets even during peak traffic. The proposed approach aims to optimize resource utilization, mitigate tail latency, and improve overall system performance.	en
dc.description.abstractgeneral	Machine Learning (ML) and Deep Learning (DL) are advanced techniques that allow computers to learn from data and make predictions or decisions without being explicitly programmed. This has led to significant advancements in various fields. Inference refers to the process of applying a trained ML model to new data to make predictions or extract insights. In the context of ML, there is a growing need for efficient and cost-effective inference capabilities. A new approach called Function-as-a-Service (FaaS) has emerged that can address this need. FaaS is a way of abstracting the server infrastructure away from the developers. This means developers can focus on writing the ML code without worrying about managing the underlying infrastructure. FaaS offers benefits such as scalability, simplified infrastructure management, and faster development cycles. However, existing FaaS platforms face challenges in ensuring fast response times and handling high levels of incoming requests. This thesis aims to address these challenges by proposing a prototype FaaS framework. The framework incorporates adaptive request batching, reactive autoscaling policies, and Service-Level Objectives (SLOs) monitoring. Request batching allows the framework to process multiple requests together, improving efficiency. Autoscaling policies ensure the system dynamically adjusts its resources based on the incoming workload. Monitoring SLOs helps track and meet performance targets, even during peak traffic. By optimizing resource utilization, reducing delays in processing requests, and improving overall system performance, the proposed approach seeks to provide efficient and cost-effective ML inference capabilities in a serverless environment.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:38236	en
dc.identifier.uri	http://hdl.handle.net/10919/116005	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.subject	Machine Learning	en
dc.subject	Deep Learning	en
dc.subject	Serverless Inference	en
dc.subject	Autoscaling	en
dc.subject	Load Balancing	en
dc.subject	Response Time	en
dc.subject	Tail Latency	en
dc.title	Towards SLO-aware Resource Scheduling for Serverless Inference Workloads	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Tripathy_A_T_2023.pdf
Size:: 674.45 KB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses