Tweet URL Extraction Crawling

In the report and supplemental code, we document our work on the Tweet URL extraction project for CS4624 (Multimedia/Hypertext/Information Access) during the spring 2018 semester at Virginia Tech. The purpose of this project is to aid our client Liuqing Li with his research in archiving digital content, part of the Global Event and Trend Archive Research (GETAR) project supported by NSF (IIS-1619028 and 1619371). The project requires tweet collections to be processed to find links most relevant to their respective events, which can be integrated into the digital library. The client has more than 1,400 tweet collections with over two billion tweets, and our team found a solution that used machine learning to deliver event related representative URLs.

Our client requested that we use a fast scripting language to build middleware to connect a large tweet collection to an event focused URL crawler. To make sure we had a representative data set during development, much of our development has centered around a specific tweet collection, which focuses on the school shooting that occurred at Marshall High School in Kentucky, USA on January 23, 2018. The event focused crawler will take the links we provide and crawl them for the purpose of collecting and archiving them in a digital library/archive system.

Our deliverables contain the following programs: extract.py, model.py, create_model.py, and conversion.py. Using the client’s tweet collection as input, extract.py scans the comma separated values (CSV) files and extracts the links from tweets containing links. Because Twitter enforces a character limit on each tweet, all links are initially shortened. Extract.py converts each link to a full URL then saves them to a file. The links at this stage are separate from the client’s tweet collection and are ready to be made into testing and training data.

All of the crucial functionalities in our program are supported by open source libraries, so our program did not require any funds to develop. Further developments of our software could create a powerful solution for our client. We believe certain functions within our code could be reused and improved upon, such as the extractor, model, and the data we used for testing and training.

Keywords

Machine learning, Data Mining, Web Crawling, Tweet Collections, GETAR, Twitter, Events Archive, Web Scraping, Article Filtering

Persistent link

http://hdl.handle.net/10919/83215

Collections

CS4624: Multimedia, Hypertext, and Information Access

Full item page

Tweet URL Extraction Crawling

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections