Toward Efficient Online Scheduling for Distributed Machine Learning Systems

dc.contributor.authorYu, Mengluen
dc.contributor.authorLiu, Jiaen
dc.contributor.authorWu, Chuanen
dc.contributor.authorJi, Boen
dc.contributor.authorBentley, Elizabethen
dc.date.accessioned2024-02-19T14:04:46Zen
dc.date.available2024-02-19T14:04:46Zen
dc.date.issued2021-08-13en
dc.description.abstractRecent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a key question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and observations on the worker-parameter server locality configurations, we transform the problem into a mixed packing and covering integer program, which enables approximation algorithm design; iii) We propose a meticulously designed approximation algorithm based on randomized rounding and rigorously analyze its performance. Collectively, our results contribute to the state of the art of distributed ML system optimization and algorithm design.en
dc.description.versionPublished versionen
dc.format.extentPages 1951-1969en
dc.format.extent19 page(s)en
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1109/TNSE.2021.3104513en
dc.identifier.eissn2327-4697en
dc.identifier.issn2327-4697en
dc.identifier.issue4en
dc.identifier.orcidJi, Bo [0000-0003-0149-7509]en
dc.identifier.urihttps://hdl.handle.net/10919/118012en
dc.identifier.volume9en
dc.language.isoenen
dc.publisherIEEEen
dc.rightsPublic Domain (U.S.)en
dc.rights.urihttp://creativecommons.org/publicdomain/mark/1.0/en
dc.subjectServersen
dc.subjectTrainingen
dc.subjectOptimizationen
dc.subjectScheduling algorithmsen
dc.subjectResource managementen
dc.subjectApproximation algorithmsen
dc.subjectHeuristic algorithmsen
dc.subjectOnline resource schedulingen
dc.subjectdistributed machine learningen
dc.subjectapproximation algorithmen
dc.titleToward Efficient Online Scheduling for Distributed Machine Learning Systemsen
dc.title.serialIEEE Transactions on Network Science and Engineeringen
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten
dc.type.otherArticleen
pubs.organisational-group/Virginia Techen
pubs.organisational-group/Virginia Tech/Engineeringen
pubs.organisational-group/Virginia Tech/Engineering/Computer Scienceen
pubs.organisational-group/Virginia Tech/All T&R Facultyen
pubs.organisational-group/Virginia Tech/Engineering/COE T&R Facultyen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Toward_Efficient_Online_Scheduling_for_Distributed_Machine_Learning_Systems.pdf
Size:
2.02 MB
Format:
Adobe Portable Document Format
Description:
Published version
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Plain Text
Description: