Show simple item record

dc.contributor.authorBridges, Chris
dc.contributor.authorChun, David
dc.contributor.authorTat, Carter
dc.description.abstractIn the report and supplemental code, we document our work on the Tweet URL extraction project for CS4624 (Multimedia/Hypertext/Information Access) during the spring 2018 semester at Virginia Tech. The purpose of this project is to aid our client Liuqing Li with his research in archiving digital content, part of the Global Event and Trend Archive Research (GETAR) project supported by NSF (IIS-1619028 and 1619371). The project requires tweet collections to be processed to find links most relevant to their respective events, which can be integrated into the digital library. The client has more than 1,400 tweet collections with over two billion tweets, and our team found a solution that used machine learning to deliver event related representative URLs. Our client requested that we use a fast scripting language to build middleware to connect a large tweet collection to an event focused URL crawler. To make sure we had a representative data set during development, much of our development has centered around a specific tweet collection, which focuses on the school shooting that occurred at Marshall High School in Kentucky, USA on January 23, 2018. The event focused crawler will take the links we provide and crawl them for the purpose of collecting and archiving them in a digital library/archive system. Our deliverables contain the following programs:,,, and Using the client’s tweet collection as input, scans the comma separated values (CSV) files and extracts the links from tweets containing links. Because Twitter enforces a character limit on each tweet, all links are initially shortened. converts each link to a full URL then saves them to a file. The links at this stage are separate from the client’s tweet collection and are ready to be made into testing and training data. All of the crucial functionalities in our program are supported by open source libraries, so our program did not require any funds to develop. Further developments of our software could create a powerful solution for our client. We believe certain functions within our code could be reused and improved upon, such as the extractor, model, and the data we used for testing and training.en_US
dc.description.sponsorshipNSF: IIS-1619028en_US
dc.description.sponsorshipNSF: IIS-1619371en_US
dc.publisherVirginia Techen_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.subjectMachine Learningen_US
dc.subjectData Miningen_US
dc.subjectWeb Crawlingen_US
dc.subjectTweet Collectionsen_US
dc.subjectEvents Archiveen_US
dc.subjectWeb Scrapingen_US
dc.subjectArticle Filteringen_US
dc.titleTweet URL Extraction Crawlingen_US
dc.description.notesDescription of Files: TweetURLExtraction_Report PDF and DOCX: A report detailing the work done on our project throughout the semester. TweetURLExtraction_Final_Presentation PDF and PPTX: Slides from a presentation given in class detailing the results of our project. Contains referenced code and data used throughout the project. Extracts links from a Tweet collection and resolves them to full URLs. Allows users to generate their own model with training data they provide. Generates a model and vectorizer which can be used in Predicts whether the given links are relevant or not. Uses a trained model and vectorizer to make these predictions. Trains a classifier model that can be used to predict whether links are relevant or not. kentucky school shooting.json: A JSON file containing an example of the sort of JSON needed as input for model.pickle: The model used to classify articles as good or bad. Trained by using the training dataset, good.txt and bad.txt. vectorizer.pickle: The vectorizer used to convert articles to vectors. Generated by using the training dataset, good.txt and bad.txt. good.txt: A list of URLs that were labeled as relevant. Used in the training process of bad.txt: A list of URLs that were labeled as not relevant. Used in the training process of stop_words.txt: A list of words that are ignored when generating the TF-IDF scores in Generates a list of random urls. Useful for creating additional non-relevant links for training purposes. url_info.csv: An example of extracted text from each URL in a Tweet collection.en_US

Files in this item


This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 3.0 United States
License: Attribution-NonCommercial-ShareAlike 3.0 United States