Learning with Constraint-Based Weak Supervision
Arachie, Chidubem Gibson
MetadataShow full item record
Recent adaptations of machine learning models in many businesses has underscored the need for quality training data. Typically, training supervised machine learning systems involves using large amounts of human-annotated data. Labeling data is expensive and can be a limiting factor in using machine learning models. To enable continued integration of machine learning systems in businesses and also easy access by users, researchers have proposed several alternatives to supervised learning. Weak supervision is one such alternative. Weak supervision or weakly supervised learning involves using noisy labels (weak signals of the data) from multiple sources to train machine learning systems. A weak supervision model aggregates multiple noisy label sources called weak signals in order to produce probabilistic labels for the data. The main allure of weak supervision is that it provides a cheap yet effective substitute for supervised learning without need for labeled data. The key challenge in training weakly supervised machine learning models is that the weak supervision leaves ambiguity about the possible true labelings of the data. In this dissertation, we aim to address the challenge associated with training weakly supervised learning models by developing new weak supervision methods. Our work focuses on learning with constraint-based weak supervision algorithms. Firstly, we will propose an adversarial labeling approach for weak supervision. In this method, the adversary chooses the labels for the data and a model learns by minimising the error made by the adversarial model. Secondly, we will propose a simple constrained based approach that minimises a quadratic objective function in order to solve for the labels of the data. Next we explain the notion of data consistency for weak supervision and propose a data consistent method for weakly supervised learning. This approach combines weak supervision labels with features of the training data to make the learned labels consistent with the data. Lastly, we use this data consistent approach to propose a general approach for improving the performance of weak supervision models. In this method, we combine weak supervision with active learning in order to generate a model that outperforms each individual approach using only a handful of labeled data. For each algorithm we propose, we report extensive empirical validation of it by testing it on standard text and image classification datasets. We compare each approach against baseline and state-of-the-art methods and show that in most cases we match or outperform the methods we compare against. We report significant gains of our method on both binary and multi-class classification tasks.
General Audience Abstract
Machine learning models learn to make predictions from data. In supervised learning, a machine learning model is fed data and corresponding labels for the data so that the model can learn to predict labels for new unseen test data. Curation of large fully supervised datasets is expensive and time consuming since it involves subject matter experts providing labels for each individual data example. The cost of collecting labels has become one of the major roadblocks for training machine learning models. An alternative to supervised training of machine learning models is weak supervision. Weak supervision or weakly supervised learning trains with cheap, and easy to define signals that noisily label the data. We refer to these signals as weak signals. A weak supervision model combines various weak signals to produce training labels for the data. The key challenge in weak supervision is how to combine the different weak signals while navigating misleading correlations in their errors. In this dissertation, we propose several algorithms for weakly supervised learning. We classify our methods as constraint-based weak supervision since weak supervision is provided as constraints to our algorithms. We use experiments on different text and image classification datasets to show that our methods are effective and outperform competing methods that we compare against. Lastly, we propose a general framework for improving the performance of weak supervision models by incorporating a few labeled data. With this method we are able to close the gap to supervised learning without the need for labeling all the data examples.
- Doctoral Dissertations