VTechWorks staff will be away for the Independence Day holiday from July 4-7. We will respond to email inquiries on Monday, July 8. Thank you for your patience.
 

Rethinking Serverless for Machine Learning Inference

TR Number

Date

2023-08-21

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

In the era of artificial intelligence and machine learning, AI/ML inference tasks have become exceedingly popular. However, executing these workloads on dedicated hardware may not be feasible for many users due to high maintenance costs, varying load patterns, and time to production. Furthermore, ML inference workloads are stateless, and most of them are not extremely latency sensitive. For example, tasks such as fake review removal, abusive language detection, tweet classification, image tagging, and free-tier-chat-bots do not require real-time inference. All these characteristics make serverless platforms a good fit for deployment, and in this work, we identify the bottlenecks involved in hosting these inference jobs on serverless and optimize serverless for better performance and resource utilization. Specifically, we identify model loading and model memory duplication as major bottlenecks in Serverless Inference, and to address these problems, we propose a new approach that rethinks the way we serve FaaS requests. To support this design, we employ a hybrid scaling approach to implement the autoscale feature of serverless.

Description

Keywords

Serverles, FaaS, Machine Learning Inference, Model Serving, Container

Citation

Collections