Unsupervised Learning of Spatiotemporal Features by Video Completion

Nallabolu, Adithya Reddy

Unsupervised Learning of Spatiotemporal Features by Video Completion

dc.contributor.author	Nallabolu, Adithya Reddy	en
dc.contributor.committeechair	Kochersberger, Kevin B.	en
dc.contributor.committeechair	Huang, Jia-Bin	en
dc.contributor.committeemember	Dhillon, Harpreet Singh	en
dc.contributor.department	Electrical and Computer Engineering	en
dc.date.accessioned	2017-10-19T08:00:43Z	en
dc.date.available	2017-10-19T08:00:43Z	en
dc.date.issued	2017-10-18	en
dc.description.abstract	In this work, we present an unsupervised representation learning approach for learning rich spatiotemporal features from videos without the supervision from semantic labels. We propose to learn the spatiotemporal features by training a 3D convolutional neural network (CNN) using video completion as a surrogate task. Using a large collection of unlabeled videos, we train the CNN to predict the missing pixels of a spatiotemporal hole given the remaining parts of the video through minimizing per-pixel reconstruction loss. To achieve good reconstruction results using color videos, the CNN needs to have a certain level of understanding of the scene dynamics and predict plausible, temporally coherent contents. We further explore to jointly reconstruct both color frames and flow fields. By exploiting the statistical temporal structure of images, we show that the learned representations capture meaningful spatiotemporal structures from raw videos. We validate the effectiveness of our approach for CNN pre-training on action recognition and action similarity labeling problems. Our quantitative results demonstrate that our method compares favorably against learning without external data and existing unsupervised learning approaches.	en
dc.description.abstractgeneral	The current supervised representation learning methods leverage large datasets of millions of labeled examples to learn semantically meaningful visual representations. Thousands of boring human hours are spent on manually labeling these datasets. But, do we need semantically labeled images to learn good visual representation? Humans learn visual representations using little or no semantic supervision but the existing approaches are mostly supervised. In this work, we propose an unsupervised visual representation learning algorithm to learn useful spatiotemporal features by formulating a video completion problem. To predict the missing pixels of the video, the model needs to have a high-level semantic understanding and motion patterns of people and objects. We demonstrate that video completion task effectively learns semantically meaningful spatiotemporal features from raw natural videos without semantic labels. The learned representation provide a good network weight initialization for applications with few training examples. We show significant performance gain over training the model from scratch and demonstrate improved performance in action recognition and action similarity labeling tasks when compared with competitive unsupervised learning algorithms.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:12668	en
dc.identifier.uri	http://hdl.handle.net/10919/79702	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Representation Learning	en
dc.subject	Supervised	en
dc.subject	Unsupervised	en
dc.title	Unsupervised Learning of Spatiotemporal Features by Video Completion	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Engineering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Nallabolu_A_T_2017.pdf
Size:: 17.03 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses