Robust Machine Learning Against Faults in Micro-Controllers and Stragglers in Distributed Training on the Cloud

dc.contributor.authorNampally, Srilalithen
dc.contributor.committeechairXiong, Wenjieen
dc.contributor.committeechairMatthews, Gretchen L.en
dc.contributor.committeememberJin, Mingen
dc.contributor.departmentElectrical and Computer Engineeringen
dc.date.accessioned2025-05-24T08:02:09Zen
dc.date.available2025-05-24T08:02:09Zen
dc.date.issued2025-05-23en
dc.description.abstractMachine learning has become a critical part of many industries in the past decade. Optimally deploying ML models onto smaller devices and efficiently training more powerful ML models in parallel in different distributed system topologies have drawn interests. This thesis studies the robustness of ML models in the two scenarios when deployed on portable micro-controller units and while being trained on distributed GPUs. This thesis first investigates the robustness of ML inference in micro-controllers. The vulnerabilities of Tiny ML models on micro-controllers are showcased using voltage based fault injection attacks. This thesis provides a comprehensive guide to quantization of ML models for embedded system deployment. Experimental results from this thesis show that it is possible to force misclassifications of model inference outputs. It also suggests defenses for protecting such physical vulnerabilities of a micro-controller running Tiny ML models. This thesis then considers the faults in distributed training of ML models on the cloud and discusses the affects and risks of stragglers. It then applies two linear coding algorithms; Gradient and Compression coding to make distributed ML training fault tolerant. This thesis shows that linear coding algorithms can be applied to GPUs. The experiments in this thesis show that using fault tolerant linear coding on GPUs does create fault tolerance to a certain number of stragglers at the cost of more training time. It finally discusses the possibility of applying linear coding algorithms to more complicated distributed training paradigms.en
dc.description.abstractgeneralMachine learning powers everything from your phone's voice assistant to smart home gadgets and various cloud services. This thesis studies two scenarios: running these models on tiny, battery‐powered devices as well as training them across many graphics processors on different systems. For ML inference on low-power devices, we show that attackers can deliberately inject faults on a device's power supply to trick a small "TinyML" model into making the wrong prediction, and we suggest a lightweight, on‐chip defense that corrects these errors with a tradeoff for higher memory usage. For training larger ML models on a cluster of GPUs, some machines inevitably lag behind ("stragglers"), slowing everything down. This slowdown can be very expensive when using high-end devices. By adding redundancy to the way computations are shared, we can design a protocol to tolerate these slowdowns without losing any accuracy—and finish training without process restarts. The experiments in this thesis confirm that: faulty injections on micro controllers can cause forced misclassifications, and training across imperfect GPU networks can be made both reliable and efficient.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:44092en
dc.identifier.urihttps://hdl.handle.net/10919/134212en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectRobust MLen
dc.subjectTiny MLen
dc.subjectDistributed Trainingen
dc.subjectCoding Theoryen
dc.titleRobust Machine Learning Against Faults in Micro-Controllers and Stragglers in Distributed Training on the Clouden
dc.typeThesisen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 1 of 1
Name:
Nampally_S_T_2025.pdf
Size:
3.4 MB
Format:
Adobe Portable Document Format

Collections