Tweet URL Extraction Crawling

Bridges, Chris; Chun, David; Tat, Carter

Tweet URL Extraction Crawling

dc.contributor.author	Bridges, Chris	en
dc.contributor.author	Chun, David	en
dc.contributor.author	Tat, Carter	en
dc.date.accessioned	2018-05-11T03:11:41Z	en
dc.date.available	2018-05-11T03:11:41Z	en
dc.date.issued	2018-05-02	en
dc.description.abstract	In the report and supplemental code, we document our work on the Tweet URL extraction project for CS4624 (Multimedia/Hypertext/Information Access) during the spring 2018 semester at Virginia Tech. The purpose of this project is to aid our client Liuqing Li with his research in archiving digital content, part of the Global Event and Trend Archive Research (GETAR) project supported by NSF (IIS-1619028 and 1619371). The project requires tweet collections to be processed to find links most relevant to their respective events, which can be integrated into the digital library. The client has more than 1,400 tweet collections with over two billion tweets, and our team found a solution that used machine learning to deliver event related representative URLs. Our client requested that we use a fast scripting language to build middleware to connect a large tweet collection to an event focused URL crawler. To make sure we had a representative data set during development, much of our development has centered around a specific tweet collection, which focuses on the school shooting that occurred at Marshall High School in Kentucky, USA on January 23, 2018. The event focused crawler will take the links we provide and crawl them for the purpose of collecting and archiving them in a digital library/archive system. Our deliverables contain the following programs: extract.py, model.py, create_model.py, and conversion.py. Using the client’s tweet collection as input, extract.py scans the comma separated values (CSV) files and extracts the links from tweets containing links. Because Twitter enforces a character limit on each tweet, all links are initially shortened. Extract.py converts each link to a full URL then saves them to a file. The links at this stage are separate from the client’s tweet collection and are ready to be made into testing and training data. All of the crucial functionalities in our program are supported by open source libraries, so our program did not require any funds to develop. Further developments of our software could create a powerful solution for our client. We believe certain functions within our code could be reused and improved upon, such as the extractor, model, and the data we used for testing and training.	en
dc.description.notes	Description of Files: TweetURLExtraction_Report PDF and DOCX: A report detailing the work done on our project throughout the semester. TweetURLExtraction_Final_Presentation PDF and PPTX: Slides from a presentation given in class detailing the results of our project. supp_code_and_data.zip: Contains referenced code and data used throughout the project. extract.py Extracts links from a Tweet collection and resolves them to full URLs. create_model.py: Allows users to generate their own model with training data they provide. Generates a model and vectorizer which can be used in conversion.py. conversion.py: Predicts whether the given links are relevant or not. Uses a trained model and vectorizer to make these predictions. model.py: Trains a classifier model that can be used to predict whether links are relevant or not. kentucky school shooting.json: A JSON file containing an example of the sort of JSON needed as input for extract.py. model.pickle: The model used to classify articles as good or bad. Trained by model.py using the training dataset, good.txt and bad.txt. vectorizer.pickle: The vectorizer used to convert articles to vectors. Generated by model.py using the training dataset, good.txt and bad.txt. good.txt: A list of URLs that were labeled as relevant. Used in the training process of model.py. bad.txt: A list of URLs that were labeled as not relevant. Used in the training process of model.py. stop_words.txt: A list of words that are ignored when generating the TF-IDF scores in model.py. randomurls.py: Generates a list of random urls. Useful for creating additional non-relevant links for training purposes. url_info.csv: An example of extracted text from each URL in a Tweet collection.	en
dc.description.sponsorship	NSF: IIS-1619028	en
dc.description.sponsorship	NSF: IIS-1619371	en
dc.identifier.uri	http://hdl.handle.net/10919/83215	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/us/	en
dc.subject	Machine learning	en
dc.subject	Data Mining	en
dc.subject	Web Crawling	en
dc.subject	Tweet Collections	en
dc.subject	GETAR	en
dc.subject	Twitter	en
dc.subject	Events Archive	en
dc.subject	Web Scraping	en
dc.subject	Article Filtering	en
dc.title	Tweet URL Extraction Crawling	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en