CTRimages

Abstract

CTRnet (Crisis, Tragedy, and Recovery network) is an NSF funded project that focuses on crawling/scanning the Internet regarding tragic events and creating digital libraries of information on those crises. CTRnet downloads webpages in regards to these events to ensure that this information is saved. As an example, CTRnet has over 440 gigabytes of webpages saved just for the Hurricane Sandy event.

Our group was assigned with creating a script to walk through the downloaded webpages, finding relevant images, and downloading them. We also researched gallery modules to create a Drupal gallery for our downloaded images.

Description
parse_images.py – a Python script that finds all URLs inside of HTML image tags and creates a text document with URLs and another with ALT tags. bannedUrls.txt – a list of URLs from which no images will be downloaded. ctrfilter – a bash file that runs the script on all .html and .htm files in the current directory and its subdirectories. filter_images.py – a Python script that filters our URLs for download based on banned URLs, image dimensions, ALT tags, and file types. It also downloads the images into a specified folder. CS4624_Documentation.docx – documentation for the project. ImageProperties.xlsx – Excel spreadsheet that has information on all images on the group of webpages we were provided with. FinalPresentation.pptx – the final presentation given during class.
Keywords
python script, image parsing, image filtering, CTR, Drupal gallery, Crisis, Tragedy, and Recovery Network Project
Citation