GPU-based Private Information Retrieval for On-Device Machine Learning Inference

Lam, Maximilian; Johnson, Jeff; Xiong, Wenjie; Maeng, Kiwan; Gupta, Udit; Li, Yang; Lai, Liangzhen; Leontiadis, Ilias; Rhu, Minsoo; Lee, Hsien-Hsin S.; Reddi, Vijay Janapa; Wei, Gu-Yeon; Brooks, David; Suh, Edward

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

dc.contributor.author	Lam, Maximilian	en
dc.contributor.author	Johnson, Jeff	en
dc.contributor.author	Xiong, Wenjie	en
dc.contributor.author	Maeng, Kiwan	en
dc.contributor.author	Gupta, Udit	en
dc.contributor.author	Li, Yang	en
dc.contributor.author	Lai, Liangzhen	en
dc.contributor.author	Leontiadis, Ilias	en
dc.contributor.author	Rhu, Minsoo	en
dc.contributor.author	Lee, Hsien-Hsin S.	en
dc.contributor.author	Reddi, Vijay Janapa	en
dc.contributor.author	Wei, Gu-Yeon	en
dc.contributor.author	Brooks, David	en
dc.contributor.author	Suh, Edward	en
dc.date.accessioned	2024-05-02T12:35:30Z	en
dc.date.available	2024-05-02T12:35:30Z	en
dc.date.issued	2024-04-27	en
dc.date.updated	2024-05-01T07:49:05Z	en
dc.description.abstract	On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20× over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5× additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100, 000 queries per second—a > 100× throughput improvement over a CPU-based baseline—while maintaining model accuracy.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1145/3617232.3624855	en
dc.identifier.uri	https://hdl.handle.net/10919/118736	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.rights	In Copyright	en
dc.rights.holder	The author(s)	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.title	GPU-based Private Information Retrieval for On-Device Machine Learning Inference	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3617232.3624855.pdf
Size:: 41.64 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Electrical and Computer Engineering