Hierarchical Bayesian Dataset Selection
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Despite the profound impact of deep learning across various domains, supervised model training critically depends on access to large, high-quality datasets, which are often challenging to identify. To address this, we introduce Hierarchical Bayesian Dataset Selection (HBDS), the first dataset selection algorithm that utilizes hierarchical Bayesian modeling, designed for collaborative data-sharing ecosystems. The proposed method efficiently decomposes the contributions of dataset groups and individual datasets to local model performance using Bayesian updates with small data samples. Our experiments on two benchmark datasets demonstrate that HBDS not only offers a computationally lightweight solution but also enhances interpretability compared to existing data selection methods, by revealing deep insights into dataset interrelationships through learned posterior distributions. HBDS outperforms traditional non-hierarchical methods by correctly identifying all relevant datasets, achieving optimal accuracy with fewer computational steps, even when initial model accuracy is low. Specifically, HBDS surpasses its non-hierarchical counterpart by 1.8% on DIGIT-FIVE and 0.7% on DOMAINNET, on average. In settings with limited resources, HBDS achieves a 6.9% higher accuracy than its non-hierarchical counterpart. These results confirm HBDS's effectiveness in identifying datasets that improve the accuracy and efficiency of deep learning models when collaborative data utilization is essential.