Scalable Systems for Machine Learning
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Federated Learning (FL) enables collaborative model training without centralized data collection, thereby preserving data privacy and reducing data transfer costs. However, deploying FL in resource-constrained distributed environments like Edge and IoT applications introduces significant challenges related to cost, scalability, and efficiency. Traditional cloud-based FL aggregator solutions are resource-inefficient and expensive when applied at the Edge, leading to low scalability and high latency. Additionally, client-side resource heterogeneity results in issues such as stragglers, dropouts, and performance variations, complicating effective client participation. This thesis explores these challenges and presents methodologies and frameworks that enhance the efficiency and scalability of FL systems in resource-constrained environments. First, an adaptive FL aggregator is presented, which is designed specifically for Edge environments, enabling users to manage the trade-off between cost and efficiency. This adaptive aggregator addresses the inefficiencies of cloud-based solutions by improving scalability and reducing latency. Second, we develop FLOAT, a framework that enhances FL client resource awareness by dynamically optimizing resource utilization to meet training deadlines and mitigating stragglers and dropouts through various optimization techniques. FLOAT employs multi-objective Reinforcement Learning with Human Feedback (RLHF) to automate the selection and configuration of these techniques, tailoring them to individual client resource conditions. Third, we design IP-FL, which treats incentivization and personalization in FL as interrelated challenges and solves them with an incentive mechanism that fosters personalized learning. IP-FL allows clients to indicate their cluster membership preferences based on data distribution and incentive-driven feedback without involving the aggregator to preserve privacy. This approach enhances the personalized model appeal for self-aware clients with high-quality data, leading to their active and consistent participation. Lastly, FLStore is proposed as a serverless framework for efficient FL non-training workloads and storage. FLStore unifies the data and compute planes on a serverless cache, enabling locality-aware execution via tailored caching policies to reduce latency and costs compared to cloud-based in-memory and object stores. FLStore integrates seamlessly with existing FL frameworks with minimal modifications, while also being fault-tolerant and highly scalable. Our work aims to contribute toward the development of efficient and scalable machine learning systems suitable for widespread deployment in Edge and IoT applications, addressing the critical challenges of cost, scalability, and efficiency in resource-constrained distributed learning environments.