GPU-based Private Information Retrieval for On-Device Machine Learning Inference
dc.contributor.author | Lam, Maximilian | en |
dc.contributor.author | Johnson, Jeff | en |
dc.contributor.author | Xiong, Wenjie | en |
dc.contributor.author | Maeng, Kiwan | en |
dc.contributor.author | Gupta, Udit | en |
dc.contributor.author | Li, Yang | en |
dc.contributor.author | Lai, Liangzhen | en |
dc.contributor.author | Leontiadis, Ilias | en |
dc.contributor.author | Rhu, Minsoo | en |
dc.contributor.author | Lee, Hsien-Hsin S. | en |
dc.contributor.author | Reddi, Vijay Janapa | en |
dc.contributor.author | Wei, Gu-Yeon | en |
dc.contributor.author | Brooks, David | en |
dc.contributor.author | Suh, Edward | en |
dc.date.accessioned | 2024-05-02T12:35:30Z | en |
dc.date.available | 2024-05-02T12:35:30Z | en |
dc.date.issued | 2024-04-27 | en |
dc.date.updated | 2024-05-01T07:49:05Z | en |
dc.description.abstract | On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20× over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5× additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100, 000 queries per second—a > 100× throughput improvement over a CPU-based baseline—while maintaining model accuracy. | en |
dc.description.version | Published version | en |
dc.format.mimetype | application/pdf | en |
dc.identifier.doi | https://doi.org/10.1145/3617232.3624855 | en |
dc.identifier.uri | https://hdl.handle.net/10919/118736 | en |
dc.language.iso | en | en |
dc.publisher | ACM | en |
dc.rights | In Copyright | en |
dc.rights.holder | The author(s) | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.title | GPU-based Private Information Retrieval for On-Device Machine Learning Inference | en |
dc.type | Article - Refereed | en |
dc.type.dcmitype | Text | en |