Rethinking Serverless for Machine Learning Inference

Ellore, Anish Reddy

Rethinking Serverless for Machine Learning Inference

Files

Ellore_A_T_2023.pdf (861.81 KB)

Downloads: 171

Date

2023-08-21

Authors

Ellore, Anish Reddy

Publisher

Virginia Tech

Abstract

In the era of artificial intelligence and machine learning, AI/ML inference tasks have become exceedingly popular. However, executing these workloads on dedicated hardware may not be feasible for many users due to high maintenance costs, varying load patterns, and time to production. Furthermore, ML inference workloads are stateless, and most of them are not extremely latency sensitive. For example, tasks such as fake review removal, abusive language detection, tweet classification, image tagging, and free-tier-chat-bots do not require real-time inference. All these characteristics make serverless platforms a good fit for deployment, and in this work, we identify the bottlenecks involved in hosting these inference jobs on serverless and optimize serverless for better performance and resource utilization. Specifically, we identify model loading and model memory duplication as major bottlenecks in Serverless Inference, and to address these problems, we propose a new approach that rethinks the way we serve FaaS requests. To support this design, we employ a hybrid scaling approach to implement the autoscale feature of serverless.

Keywords

Serverles, FaaS, Machine Learning Inference, Model Serving, Container

Persistent link

http://hdl.handle.net/10919/116068

Collections

Masters Theses

Full item page

Rethinking Serverless for Machine Learning Inference

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections