Label-Efficient Visual Understanding with Consistency Constraints

dc.contributor.authorZou, Yuliangen
dc.contributor.committeechairHuang, Jia-Binen
dc.contributor.committeememberTokekar, Pratapen
dc.contributor.committeememberAbbott, A. Lynnen
dc.contributor.committeememberDhillon, Harpreet Singhen
dc.contributor.committeememberHuang, Berten
dc.contributor.departmentElectrical and Computer Engineeringen
dc.date.accessioned2022-05-25T08:00:21Zen
dc.date.available2022-05-25T08:00:21Zen
dc.date.issued2022-05-24en
dc.description.abstractModern deep neural networks are proficient at solving various visual recognition and understanding tasks, as long as a sufficiently large labeled dataset is available during the training time. However, the progress of these visual tasks is limited by the number of manual annotations. On the other hand, it is usually time-consuming and error-prone to annotate visual data, rendering the challenge of scaling up human labeling for many visual tasks. Fortunately, it is easy to collect large-scale, diverse unlabeled visual data from the Internet. And we can acquire a large amount of synthetic visual data with annotations from game engines effortlessly. In this dissertation, we explore how to utilize the unlabeled data and synthetic labeled data for various visual tasks, aiming to replace or reduce the direct supervision from the manual annotations. The key idea is to encourage deep neural networks to produce consistent predictions across different transformations (\eg geometry, temporal, photometric, etc.). We organize the dissertation as follows. In Part I, we propose to use the consistency over different geometric formulations and a cycle consistency over time to tackle the low-level scene geometry perception tasks in a self-supervised learning setting. In Part II, we tackle the high-level semantic understanding tasks in a semi-supervised learning setting, with the constraint that different augmented views of the same visual input maintain consistent semantic information. In Part III, we tackle the cross-domain image segmentation problem. By encouraging an adaptive segmentation model to output consistent results for a diverse set of strongly-augmented synthetic data, the model learns to perform test-time adaptation on unseen target domains with one single forward pass, without model training or optimization at the inference time.en
dc.description.abstractgeneralRecently, deep learning has emerged as one of the most powerful tools to solve various visual understanding tasks. However, the development of deep learning methods is significantly limited by the amount of manually labeled data. On the other hand, it is usually time-consuming and error-prone to annotate visual data, making the human labeling process not easily scalable. Fortunately, it is easy to collect large-scale, diverse raw visual data from the Internet (\eg search engines, YouTube, Instagram, etc.). And we can acquire a large amount of synthetic visual data with annotations from game engines effortlessly. In this dissertation, we explore how we can utilize the raw visual data and synthetic data for various visual tasks, aiming to replace or reduce the direct supervision from the manual annotations. The key idea behind this is to encourage deep neural networks to produce consistent predictions of the same visual input across different transformations (\eg geometry, temporal, photometric, etc.). We organize the dissertation as follows. In Part I, we propose using the consistency over different geometric formulations and a forward-backward cycle consistency over time to tackle the low-level scene geometry perception tasks, using unlabeled visual data only. In Part II, we tackle the high-level semantic understanding tasks using both a small amount of labeled data and a large amount of unlabeled data jointly, with the constraint that different augmented views of the same visual input maintain consistent semantic information. In Part III, we tackle the cross-domain image segmentation problem. By encouraging an adaptive segmentation model to output consistent results for a diverse set of strongly-augmented synthetic data, the model learns to perform test-time adaptation on unseen target domains.en
dc.description.degreeDoctor of Philosophyen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:34390en
dc.identifier.urihttp://hdl.handle.net/10919/110313en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/en
dc.subjectLabel-Efficienten
dc.subjectConsistency Regularizationen
dc.subjectVisual Understandingen
dc.subjectSelf-Supervised Learningen
dc.subjectSemi-Supervised Learningen
dc.subjectPseudo Labelingen
dc.subjectTest-Time Adaptationen
dc.subjectBatchNorm Calibrationen
dc.subjectCross-Domain Generalizationen
dc.titleLabel-Efficient Visual Understanding with Consistency Constraintsen
dc.typeDissertationen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.nameDoctor of Philosophyen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zou_Y_D_2022.pdf
Size:
41.16 MB
Format:
Adobe Portable Document Format