Label-Efficient Visual Understanding with Consistency Constraints
dc.contributor.author | Zou, Yuliang | en |
dc.contributor.committeechair | Huang, Jia-Bin | en |
dc.contributor.committeemember | Tokekar, Pratap | en |
dc.contributor.committeemember | Abbott, A. Lynn | en |
dc.contributor.committeemember | Dhillon, Harpreet Singh | en |
dc.contributor.committeemember | Huang, Bert | en |
dc.contributor.department | Electrical and Computer Engineering | en |
dc.date.accessioned | 2022-05-25T08:00:21Z | en |
dc.date.available | 2022-05-25T08:00:21Z | en |
dc.date.issued | 2022-05-24 | en |
dc.description.abstract | Modern deep neural networks are proficient at solving various visual recognition and understanding tasks, as long as a sufficiently large labeled dataset is available during the training time. However, the progress of these visual tasks is limited by the number of manual annotations. On the other hand, it is usually time-consuming and error-prone to annotate visual data, rendering the challenge of scaling up human labeling for many visual tasks. Fortunately, it is easy to collect large-scale, diverse unlabeled visual data from the Internet. And we can acquire a large amount of synthetic visual data with annotations from game engines effortlessly. In this dissertation, we explore how to utilize the unlabeled data and synthetic labeled data for various visual tasks, aiming to replace or reduce the direct supervision from the manual annotations. The key idea is to encourage deep neural networks to produce consistent predictions across different transformations (\eg geometry, temporal, photometric, etc.). We organize the dissertation as follows. In Part I, we propose to use the consistency over different geometric formulations and a cycle consistency over time to tackle the low-level scene geometry perception tasks in a self-supervised learning setting. In Part II, we tackle the high-level semantic understanding tasks in a semi-supervised learning setting, with the constraint that different augmented views of the same visual input maintain consistent semantic information. In Part III, we tackle the cross-domain image segmentation problem. By encouraging an adaptive segmentation model to output consistent results for a diverse set of strongly-augmented synthetic data, the model learns to perform test-time adaptation on unseen target domains with one single forward pass, without model training or optimization at the inference time. | en |
dc.description.abstractgeneral | Recently, deep learning has emerged as one of the most powerful tools to solve various visual understanding tasks. However, the development of deep learning methods is significantly limited by the amount of manually labeled data. On the other hand, it is usually time-consuming and error-prone to annotate visual data, making the human labeling process not easily scalable. Fortunately, it is easy to collect large-scale, diverse raw visual data from the Internet (\eg search engines, YouTube, Instagram, etc.). And we can acquire a large amount of synthetic visual data with annotations from game engines effortlessly. In this dissertation, we explore how we can utilize the raw visual data and synthetic data for various visual tasks, aiming to replace or reduce the direct supervision from the manual annotations. The key idea behind this is to encourage deep neural networks to produce consistent predictions of the same visual input across different transformations (\eg geometry, temporal, photometric, etc.). We organize the dissertation as follows. In Part I, we propose using the consistency over different geometric formulations and a forward-backward cycle consistency over time to tackle the low-level scene geometry perception tasks, using unlabeled visual data only. In Part II, we tackle the high-level semantic understanding tasks using both a small amount of labeled data and a large amount of unlabeled data jointly, with the constraint that different augmented views of the same visual input maintain consistent semantic information. In Part III, we tackle the cross-domain image segmentation problem. By encouraging an adaptive segmentation model to output consistent results for a diverse set of strongly-augmented synthetic data, the model learns to perform test-time adaptation on unseen target domains. | en |
dc.description.degree | Doctor of Philosophy | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:34390 | en |
dc.identifier.uri | http://hdl.handle.net/10919/110313 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons Attribution-NonCommercial 4.0 International | en |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/4.0/ | en |
dc.subject | Label-Efficient | en |
dc.subject | Consistency Regularization | en |
dc.subject | Visual Understanding | en |
dc.subject | Self-Supervised Learning | en |
dc.subject | Semi-Supervised Learning | en |
dc.subject | Pseudo Labeling | en |
dc.subject | Test-Time Adaptation | en |
dc.subject | BatchNorm Calibration | en |
dc.subject | Cross-Domain Generalization | en |
dc.title | Label-Efficient Visual Understanding with Consistency Constraints | en |
dc.type | Dissertation | en |
thesis.degree.discipline | Computer Engineering | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | doctoral | en |
thesis.degree.name | Doctor of Philosophy | en |
Files
Original bundle
1 - 1 of 1