Tweet URL Extraction Crawling

dc.contributor.authorBridges, Chrisen
dc.contributor.authorChun, Daviden
dc.contributor.authorTat, Carteren
dc.date.accessioned2018-05-11T03:11:41Zen
dc.date.available2018-05-11T03:11:41Zen
dc.date.issued2018-05-02en
dc.description.abstractIn the report and supplemental code, we document our work on the Tweet URL extraction project for CS4624 (Multimedia/Hypertext/Information Access) during the spring 2018 semester at Virginia Tech. The purpose of this project is to aid our client Liuqing Li with his research in archiving digital content, part of the Global Event and Trend Archive Research (GETAR) project supported by NSF (IIS-1619028 and 1619371). The project requires tweet collections to be processed to find links most relevant to their respective events, which can be integrated into the digital library. The client has more than 1,400 tweet collections with over two billion tweets, and our team found a solution that used machine learning to deliver event related representative URLs. Our client requested that we use a fast scripting language to build middleware to connect a large tweet collection to an event focused URL crawler. To make sure we had a representative data set during development, much of our development has centered around a specific tweet collection, which focuses on the school shooting that occurred at Marshall High School in Kentucky, USA on January 23, 2018. The event focused crawler will take the links we provide and crawl them for the purpose of collecting and archiving them in a digital library/archive system. Our deliverables contain the following programs: extract.py, model.py, create_model.py, and conversion.py. Using the client’s tweet collection as input, extract.py scans the comma separated values (CSV) files and extracts the links from tweets containing links. Because Twitter enforces a character limit on each tweet, all links are initially shortened. Extract.py converts each link to a full URL then saves them to a file. The links at this stage are separate from the client’s tweet collection and are ready to be made into testing and training data. All of the crucial functionalities in our program are supported by open source libraries, so our program did not require any funds to develop. Further developments of our software could create a powerful solution for our client. We believe certain functions within our code could be reused and improved upon, such as the extractor, model, and the data we used for testing and training.en
dc.description.notesDescription of Files: TweetURLExtraction_Report PDF and DOCX: A report detailing the work done on our project throughout the semester. TweetURLExtraction_Final_Presentation PDF and PPTX: Slides from a presentation given in class detailing the results of our project. supp_code_and_data.zip: Contains referenced code and data used throughout the project. extract.py Extracts links from a Tweet collection and resolves them to full URLs. create_model.py: Allows users to generate their own model with training data they provide. Generates a model and vectorizer which can be used in conversion.py. conversion.py: Predicts whether the given links are relevant or not. Uses a trained model and vectorizer to make these predictions. model.py: Trains a classifier model that can be used to predict whether links are relevant or not. kentucky school shooting.json: A JSON file containing an example of the sort of JSON needed as input for extract.py. model.pickle: The model used to classify articles as good or bad. Trained by model.py using the training dataset, good.txt and bad.txt. vectorizer.pickle: The vectorizer used to convert articles to vectors. Generated by model.py using the training dataset, good.txt and bad.txt. good.txt: A list of URLs that were labeled as relevant. Used in the training process of model.py. bad.txt: A list of URLs that were labeled as not relevant. Used in the training process of model.py. stop_words.txt: A list of words that are ignored when generating the TF-IDF scores in model.py. randomurls.py: Generates a list of random urls. Useful for creating additional non-relevant links for training purposes. url_info.csv: An example of extracted text from each URL in a Tweet collection.en
dc.description.sponsorshipNSF: IIS-1619028en
dc.description.sponsorshipNSF: IIS-1619371en
dc.identifier.urihttp://hdl.handle.net/10919/83215en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/en
dc.subjectMachine learningen
dc.subjectData Miningen
dc.subjectWeb Crawlingen
dc.subjectTweet Collectionsen
dc.subjectGETARen
dc.subjectTwitteren
dc.subjectEvents Archiveen
dc.subjectWeb Scrapingen
dc.subjectArticle Filteringen
dc.titleTweet URL Extraction Crawlingen
dc.typeDataseten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
supp_code_and_data.zip
Size:
14.84 MB
Format:
Loading...
Thumbnail Image
Name:
TweetURLExtraction_Final_Presentation.pdf
Size:
301.83 KB
Format:
Adobe Portable Document Format
Name:
TweetURLExtraction_Final_Presentation.pptx
Size:
785.49 KB
Format:
Microsoft Powerpoint XML
Name:
TweetURLExtraction_Final_Report.docx
Size:
1.11 MB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
TweetURLExtraction_Final_Report.pdf
Size:
2.62 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: