SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

Khan, Redwan; Yazdani, Ahmad; Fu, Yuqi; Paul, Arnab; Ji, Bo; Jian, Xun; Cheng, Yue; Butt, Ali

SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

dc.contributor.author	Khan, Redwan	en
dc.contributor.author	Yazdani, Ahmad	en
dc.contributor.author	Fu, Yuqi	en
dc.contributor.author	Paul, Arnab	en
dc.contributor.author	Ji, Bo	en
dc.contributor.author	Jian, Xun	en
dc.contributor.author	Cheng, Yue	en
dc.contributor.author	Butt, Ali	en
dc.date.accessioned	2024-02-19T14:22:29Z	en
dc.date.available	2024-02-19T14:22:29Z	en
dc.date.issued	2023	en
dc.description.abstract	Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose new challenges for storage system design. DLT is I/O intensive since data samples need to be fetched continuously from a remote storage. Accelerators such as GPUs have been extensively used to support these applications. As accelerators become more powerful and more data-hungry, the I/O performance lags behind. This creates a crucial performance bottleneck, especially in distributed DLT. At the same time, the exponentially growing dataset sizes make it impossible to store these datasets entirely in memory. While today’s DLT frameworks typically use a random sampling policy that treat all samples uniformly equally, recent findings indicate that not all samples are equally important and different data samples contribute differently towards improving the accuracy of a model. This observation creates an opportunity for DLT I/O optimizations by exploiting the data locality enabled by importance sampling. To this end, we design and implement SHADE, a new DLT-aware caching system that detects fine-grained importance variations at per-sample level and leverages the variance to make informed caching decisions for a distributed DLT job. SHADE adopts a novel, rank-based approach, which captures the relative importance of data samples across different minibatches. SHADE then dynamically updates the importance scores of all samples during training. With these techniques, SHADE manages to significantly improve the cache hit ratio of the DLT job, and thus, improves the job’s training performance. Evaluation with representative computer vision (CV) models shows that SHADE, with a small cache, improves the cache hit ratio by up to 4.5× compared to the LRU caching policy.	en
dc.description.version	Published version	en
dc.format.extent	Pages 135-151	en
dc.format.extent	17 page(s)	en
dc.format.mimetype	application/pdf	en
dc.identifier.orcid	Ji, Bo [0000-0003-0149-7509]	en
dc.identifier.orcid	Jian, Xun [0000-0002-7120-7426]	en
dc.identifier.orcid	Butt, Ali [0000-0002-0871-7263]	en
dc.identifier.uri	https://hdl.handle.net/10919/118017	en
dc.language.iso	en	en
dc.publisher	Usenix Association	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.title	SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training	en
dc.title.serial	USENIX FAST 2023	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en
dc.type.other	Article	en
dcterms.dateAccepted	2022-12-09	en
pubs.organisational-group	/Virginia Tech	en
pubs.organisational-group	/Virginia Tech/Engineering	en
pubs.organisational-group	/Virginia Tech/Engineering/Computer Science	en
pubs.organisational-group	/Virginia Tech/All T&R Faculty	en
pubs.organisational-group	/Virginia Tech/Engineering/COE T&R Faculty	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: fast23-khan.pdf
Size:: 856.85 KB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Plain Text
Description:

Download

Collections

All Faculty Deposits
Scholarly Works, Computer Science