Rethinking Serverless for Machine Learning Inference

Ellore, Anish Reddy

Rethinking Serverless for Machine Learning Inference

dc.contributor.author	Ellore, Anish Reddy	en
dc.contributor.committeechair	Butt, Ali	en
dc.contributor.committeemember	Hu, Liting	en
dc.contributor.committeemember	Williams, Daniel John	en
dc.contributor.department	Computer Science and Applications	en
dc.date.accessioned	2023-08-22T08:00:17Z	en
dc.date.available	2023-08-22T08:00:17Z	en
dc.date.issued	2023-08-21	en
dc.description.abstract	In the era of artificial intelligence and machine learning, AI/ML inference tasks have become exceedingly popular. However, executing these workloads on dedicated hardware may not be feasible for many users due to high maintenance costs, varying load patterns, and time to production. Furthermore, ML inference workloads are stateless, and most of them are not extremely latency sensitive. For example, tasks such as fake review removal, abusive language detection, tweet classification, image tagging, and free-tier-chat-bots do not require real-time inference. All these characteristics make serverless platforms a good fit for deployment, and in this work, we identify the bottlenecks involved in hosting these inference jobs on serverless and optimize serverless for better performance and resource utilization. Specifically, we identify model loading and model memory duplication as major bottlenecks in Serverless Inference, and to address these problems, we propose a new approach that rethinks the way we serve FaaS requests. To support this design, we employ a hybrid scaling approach to implement the autoscale feature of serverless.	en
dc.description.abstractgeneral	Most modern software applications leverage the power of machine learning to incorporate intelligent features. For instance, platforms like Yelp employ machine learning algorithms to detect fake reviews, while intelligent chatbots such as ChatGPT provide interactive conversations. Even Netflix relies on machine learning to recommend personalized content to its users. The process of creating these machine learning services involves several stages, including data collection, model training using the collected data, and serving the trained model to deploy the service. This final stage, known as inference, is crucial for delivering real-time predictions or responses to user queries. In our research, we focus on selecting serverless computing as the preferred infrastructure for deploying these popular inference workloads. Serverless, also referred to as Function as a Service (FaaS), is an execution paradigm in cloud computing that allows users to efficiently run their code by providing scalability, elasticity and fine-grained billing. In this work we identified, model loading and model memory duplication as major bottlenecks in Serverless Inference. To solve these problems we propose a new approach which rethinks the way we serve FaaS requests. To support this design we use a hybrid scaling approach to implement the autoscale feature of serverless.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:38235	en
dc.identifier.uri	http://hdl.handle.net/10919/116068	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Serverles	en
dc.subject	FaaS	en
dc.subject	Machine Learning Inference	en
dc.subject	Model Serving	en
dc.subject	Container	en
dc.title	Rethinking Serverless for Machine Learning Inference	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ellore_A_T_2023.pdf
Size:: 861.81 KB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses