Robust Machine Learning Against Faults in Micro-Controllers and Stragglers in Distributed Training on the Cloud

Nampally, Srilalith

Robust Machine Learning Against Faults in Micro-Controllers and Stragglers in Distributed Training on the Cloud

dc.contributor.author	Nampally, Srilalith	en
dc.contributor.committeechair	Xiong, Wenjie	en
dc.contributor.committeechair	Matthews, Gretchen L.	en
dc.contributor.committeemember	Jin, Ming	en
dc.contributor.department	Electrical and Computer Engineering	en
dc.date.accessioned	2025-05-24T08:02:09Z	en
dc.date.available	2025-05-24T08:02:09Z	en
dc.date.issued	2025-05-23	en
dc.description.abstract	Machine learning has become a critical part of many industries in the past decade. Optimally deploying ML models onto smaller devices and efficiently training more powerful ML models in parallel in different distributed system topologies have drawn interests. This thesis studies the robustness of ML models in the two scenarios when deployed on portable micro-controller units and while being trained on distributed GPUs. This thesis first investigates the robustness of ML inference in micro-controllers. The vulnerabilities of Tiny ML models on micro-controllers are showcased using voltage based fault injection attacks. This thesis provides a comprehensive guide to quantization of ML models for embedded system deployment. Experimental results from this thesis show that it is possible to force misclassifications of model inference outputs. It also suggests defenses for protecting such physical vulnerabilities of a micro-controller running Tiny ML models. This thesis then considers the faults in distributed training of ML models on the cloud and discusses the affects and risks of stragglers. It then applies two linear coding algorithms; Gradient and Compression coding to make distributed ML training fault tolerant. This thesis shows that linear coding algorithms can be applied to GPUs. The experiments in this thesis show that using fault tolerant linear coding on GPUs does create fault tolerance to a certain number of stragglers at the cost of more training time. It finally discusses the possibility of applying linear coding algorithms to more complicated distributed training paradigms.	en
dc.description.abstractgeneral	Machine learning powers everything from your phone's voice assistant to smart home gadgets and various cloud services. This thesis studies two scenarios: running these models on tiny, battery‐powered devices as well as training them across many graphics processors on different systems. For ML inference on low-power devices, we show that attackers can deliberately inject faults on a device's power supply to trick a small "TinyML" model into making the wrong prediction, and we suggest a lightweight, on‐chip defense that corrects these errors with a tradeoff for higher memory usage. For training larger ML models on a cluster of GPUs, some machines inevitably lag behind ("stragglers"), slowing everything down. This slowdown can be very expensive when using high-end devices. By adding redundancy to the way computations are shared, we can design a protocol to tolerate these slowdowns without losing any accuracy—and finish training without process restarts. The experiments in this thesis confirm that: faulty injections on micro controllers can cause forced misclassifications, and training across imperfect GPU networks can be made both reliable and efficient.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:44092	en
dc.identifier.uri	https://hdl.handle.net/10919/134212	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Robust ML	en
dc.subject	Tiny ML	en
dc.subject	Distributed Training	en
dc.subject	Coding Theory	en
dc.title	Robust Machine Learning Against Faults in Micro-Controllers and Stragglers in Distributed Training on the Cloud	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Engineering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Nampally_S_T_2025.pdf
Size:: 3.4 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses