Going Deeper with Images and Natural Language

TR Number
Journal Title
Journal ISSN
Volume Title
Virginia Tech

One aim in the area of artificial intelligence (AI) is to develop a smart agent with high intelligence that is able to perceive and understand the complex visual environment around us. More ambitiously, it should be able to interact with us about its surroundings in natural languages. Thanks to the progress made in deep learning, we've seen huge breakthroughs towards this goal over the last few years. The developments have been extremely rapid in visual recognition, in which machines now can categorize images into multiple classes, and detect various objects within an image, with an ability that is competitive with or even surpasses that of humans. Meanwhile, we also have witnessed similar strides in natural language processing (NLP). It is quite often for us to see that now computers are able to almost perfectly do text classification, machine translation, etc. However, despite much inspiring progress, most of the achievements made are still within one domain, not handling inter-domain situations. The interaction between the visual and textual areas is still quite limited, although there has been progress in image captioning, visual question answering, etc.

In this dissertation, we design models and algorithms that enable us to build in-depth connections between images and natural languages, which help us to better understand their inner structures. In particular, first we study how to make machines generate image descriptions that are indistinguishable from ones expressed by humans, which as a result also achieved better quantitative evaluation performance. Second, we devise a novel algorithm for measuring review congruence, which takes an image and review text as input and quantifies the relevance of each sentence to the image. The whole model is trained without any supervised ground truth labels. Finally, we propose a brand new AI task called Image Aspect Mining, to detect visual aspects in images and identify aspect level rating within the review context.

On the theoretical side, this research contributes to multiple research areas in Computer Vision (CV), Natural Language Processing (NLP), interactions between CVandNLP, and Deep Learning. Regarding impact, these techniques will benefit related users such as the visually impaired, customers reading reviews, merchants, and AI researchers in general.

Image Captioning, Quasi-Supervised Learning, Image Aspect Mining, GANs, Deep learning (Machine learning)