GPU-based Private Information Retrieval for On-Device Machine Learning Inference

dc.contributor.authorLam, Maximilianen
dc.contributor.authorJohnson, Jeffen
dc.contributor.authorXiong, Wenjieen
dc.contributor.authorMaeng, Kiwanen
dc.contributor.authorGupta, Uditen
dc.contributor.authorLi, Yangen
dc.contributor.authorLai, Liangzhenen
dc.contributor.authorLeontiadis, Iliasen
dc.contributor.authorRhu, Minsooen
dc.contributor.authorLee, Hsien-Hsin S.en
dc.contributor.authorReddi, Vijay Janapaen
dc.contributor.authorWei, Gu-Yeonen
dc.contributor.authorBrooks, Daviden
dc.contributor.authorSuh, Edwarden
dc.date.accessioned2024-05-02T12:35:30Zen
dc.date.available2024-05-02T12:35:30Zen
dc.date.issued2024-04-27en
dc.date.updated2024-05-01T07:49:05Zen
dc.description.abstractOn-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20× over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5× additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100, 000 queries per second—a > 100× throughput improvement over a CPU-based baseline—while maintaining model accuracy.en
dc.description.versionPublished versionen
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1145/3617232.3624855en
dc.identifier.urihttps://hdl.handle.net/10919/118736en
dc.language.isoenen
dc.publisherACMen
dc.rightsIn Copyrighten
dc.rights.holderThe author(s)en
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.titleGPU-based Private Information Retrieval for On-Device Machine Learning Inferenceen
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3617232.3624855.pdf
Size:
41.64 MB
Format:
Adobe Portable Document Format
Description:
Published version
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: