Robust Machine Learning Against Faults in Micro-Controllers and Stragglers in Distributed Training on the Cloud

TR Number

Date

2025-05-23

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Machine learning has become a critical part of many industries in the past decade. Optimally deploying ML models onto smaller devices and efficiently training more powerful ML models in parallel in different distributed system topologies have drawn interests. This thesis studies the robustness of ML models in the two scenarios when deployed on portable micro-controller units and while being trained on distributed GPUs. This thesis first investigates the robustness of ML inference in micro-controllers. The vulnerabilities of Tiny ML models on micro-controllers are showcased using voltage based fault injection attacks. This thesis provides a comprehensive guide to quantization of ML models for embedded system deployment. Experimental results from this thesis show that it is possible to force misclassifications of model inference outputs. It also suggests defenses for protecting such physical vulnerabilities of a micro-controller running Tiny ML models. This thesis then considers the faults in distributed training of ML models on the cloud and discusses the affects and risks of stragglers. It then applies two linear coding algorithms; Gradient and Compression coding to make distributed ML training fault tolerant. This thesis shows that linear coding algorithms can be applied to GPUs. The experiments in this thesis show that using fault tolerant linear coding on GPUs does create fault tolerance to a certain number of stragglers at the cost of more training time. It finally discusses the possibility of applying linear coding algorithms to more complicated distributed training paradigms.

Description

Keywords

Robust ML, Tiny ML, Distributed Training, Coding Theory

Citation

Collections