CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study

Djoufack Basso, Larissa

CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study

dc.contributor.author	Djoufack Basso, Larissa	en
dc.contributor.committeechair	Lu, Chang Tien	en
dc.contributor.committeemember	Cho, Jin-Hee	en
dc.contributor.committeemember	Chen, Ing Ray	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2022-06-22T08:00:45Z	en
dc.date.available	2022-06-22T08:00:45Z	en
dc.date.issued	2022-06-21	en
dc.description.abstract	Satellite imagery research used to be an expensive research topic for companies and organizations due to the limited data and compute resources. As the computing power and storage capacity grows exponentially, a large amount of aerial and satellite images are generated and analyzed everyday for various applications. Current technological advancement and extensive data collection by numerous Internet of Things (IOT) devices and platforms have amplified labeled natural images. Such data availability catalyzed the development and performance of current state-of-the-art image classification and cross-modal models. Despite the abundance of publicly available remote sensing images, very few remote sensing (RS) images are labeled and even fewer are multi-captioned.These scarcities limit the scope of fine tuned state of the art models to at most 38 classes, based on the PatternNet data, one of the largest publicly available labeled RS data. Recent state-of-the art image-to-image retrieval and detection models in RS have shown great results. Because the text-to-image retrieval of RS images is still emerging, it still faces some challenges in the retrieval of those images.These challenges are based on the inaccurate retrieval of image categories that were not present in the training dataset and the retrieval of images from descriptive input. Motivated by those shortcomings in current cross-modal remote sensing image retrieval, we proposed CLIP-RS, a cross-modal remote sensing image retrieval platform. Our proposed framework CLIP-RS is a framework that combines a fine-tuned implementation of a recent state of the art cross-modal and text-based image retrieval model, Contrastive Language Image Pre-training (CLIP) and FAISS (Facebook AI similarity search), a library for efficient similarity search. Our implementation is deployed on a Web App for inference task on text-to-image and image-to-image retrieval of RS images collected via the Mapbox GL JS API. We used the free tier option of the Mapbox GL JS API and took advantage of its raster tiles option to locate the retrieved results on a local map, a combination of the downloaded raster tiles. Other options offered on our platform are: image similarity search, locating an image in the map, view images' geocoordinates and addresses.In this work we also proposed two remote sensing fine-tuned models and conducted a comparative analysis of our proposed models with a different fine-tuned model as well as the zeroshot CLIP model on remote sensing data.	en
dc.description.abstractgeneral	Satellite imagery research used to be an expensive research topic for companies and organizations due to the limited data and compute resources. As the computing power and storage capacity grows exponentially, a large amount of aerial and satellite images are generated and analyzed everyday for various applications. Current technological advancement and extensive data collection by numerous Internet of Things (IOT) devices and platforms have amplified labeled natural images. Such data availability catalyzed the devel- opment and performance of current state-of-the-art image classification and cross-modal models. Despite the abundance of publicly available remote sens- ing images, very few remote sensing (RS) images are labeled and even fewer are multi-captioned.These scarcities limit the scope of fine tuned state of the art models to at most 38 classes, based on the PatternNet data,one of the largest publicly avail- able labeled RS data.Recent state-of-the art image-to-image retrieval and detection models in RS have shown great results. Because the text-to-image retrieval of RS images is still emerging, it still faces some challenges in the re- trieval of those images.These challenges are based on the inaccurate retrieval of image categories that were not present in the training dataset and the re- trieval of images from descriptive input. Motivated by those shortcomings in current cross-modal remote sensing image retrieval, we proposed CLIP-RS, a cross-modal remote sensing image retrieval platform.Cross-modal retrieval focuses on data retrieval across different modalities and in the context of this work, we focus on textual and imagery modalities.Our proposed frame- work CLIP-RS is a framework that combines a fine-tuned implementation of a recent state of the art cross-modal and text-based image retrieval model, Contrastive Language Image Pre-training (CLIP) and FAISS (Facebook AI similarity search), a library for efficient similarity search. In deep learning, the concept of fine tuning consists of using weights from a different model or algorithm into a similar model with different domain specific application. Our implementation is deployed on a Web Application for inference tasks on text-to-image and image-to-image retrieval of RS images collected via the Mapbox GL JS API. We used the free tier option of the Mapbox GL JS API and took advantage of its raster tiles option to locate the retrieved results on a local map, a combination of the downloaded raster tiles. Other options offered on our platform are: image similarity search, locating an image in the map, view images' geocoordinates and addresses.In this work we also pro- posed two remote sensing fine-tuned models and conducted a comparative analysis of our proposed models with a different fine-tuned model as well as the zeroshot CLIP model on remote sensing data. Detection models in RS have shown great results. Because the text-to-image retrieval of RS images is still emerging, it still faces some challenges in the re- trieval of those images.These challenges are based on the inaccurate retrieval of image categories that were not present in the training dataset and the re- trieval of images from descriptive input. Motivated by those shortcomings in current cross-modal remote sensing image retrieval, we proposed CLIP-RS, a cross-modal remote sensing image retrieval platform.Cross-modal retrieval focuses on data retrieval across different modalities and in the context of this work, we focus on textual and imagery modalities.Our proposed frame- work CLIP-RS is a framework that combines a fine-tuned implementation of a recent state of the art cross-modal and text-based image retrieval model, Contrastive Language Image Pre-training (CLIP) and FAISS (Facebook AI similarity search), a library for efficient similarity search. In deep learning, the concept of fine tuning consists of using weights from a different model or algorithm into a similar model with different domain specific application. Our implementation is deployed on a Web Application for inference tasks on text-to-image and image-to-image retrieval of RS images collected via the Mapbox GL JS API. We used the free tier option of the Mapbox GL JS API and took advantage of its raster tiles option to locate the retrieved results on a local map, a combination of the downloaded raster tiles. Other options offered on our platform are: image similarity search, locating an image in the map, view images' geocoordinates and addresses.In this work we also pro- posed two remote sensing fine-tuned models and conducted a comparative analysis of our proposed models with a different fine-tuned model as well as the zeroshot CLIP model on remote sensing data.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:35188	en
dc.identifier.uri	http://hdl.handle.net/10919/110853	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Remote Sensing Image Retrieval	en
dc.subject	Textual input	en
dc.subject	Spatial Database Indexing and Retrieval	en
dc.subject	Contrastive Learning	en
dc.subject	Cross-modal	en
dc.title	CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 3 of 3

Name:: Djoufack_Basso_L_T_2022.pdf
Size:: 15.87 MB
Format:: Adobe Portable Document Format

Download

Name:: Djoufack_Basso_L_T_2022_support_3.docx
Size:: 10.21 KB
Format:: Microsoft Word XML
Description:: Supporting documents

Download

Name:: Djoufack_Basso_L_T_2022_support_1.pdf
Size:: 15.87 MB
Format:: Adobe Portable Document Format
Description:: Supporting documents

Download

Collections

Masters Theses