VTechWorks staff will be away for the Thanksgiving holiday beginning at noon on Wednesday, November 27, through Friday, November 29. We will resume normal operations on Monday, December 2. Thank you for your patience.
 

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

dc.contributor.authorLam, Maximilianen
dc.contributor.authorJohnson, Jeffen
dc.contributor.authorXiong, Wenjieen
dc.contributor.authorMaeng, Kiwanen
dc.contributor.authorGupta, Uditen
dc.contributor.authorLi, Yangen
dc.contributor.authorLai, Liangzhenen
dc.contributor.authorLeontiadis, Iliasen
dc.contributor.authorRhu, Minsooen
dc.contributor.authorLee, Hsien-Hsin S.en
dc.contributor.authorReddi, Vijay Janapaen
dc.contributor.authorWei, Gu-Yeonen
dc.contributor.authorBrooks, Daviden
dc.contributor.authorSuh, Edwarden
dc.date.accessioned2024-05-02T12:35:30Zen
dc.date.available2024-05-02T12:35:30Zen
dc.date.issued2024-04-27en
dc.date.updated2024-05-01T07:49:05Zen
dc.description.abstractOn-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20× over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5× additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100, 000 queries per second—a > 100× throughput improvement over a CPU-based baseline—while maintaining model accuracy.en
dc.description.versionPublished versionen
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1145/3617232.3624855en
dc.identifier.urihttps://hdl.handle.net/10919/118736en
dc.language.isoenen
dc.publisherACMen
dc.rightsIn Copyrighten
dc.rights.holderThe author(s)en
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.titleGPU-based Private Information Retrieval for On-Device Machine Learning Inferenceen
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3617232.3624855.pdf
Size:
41.64 MB
Format:
Adobe Portable Document Format
Description:
Published version
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: