Global Event Crawler and Seed Generator for GETAR

Manchester, Emma; Srinivasan, Ravi; Crenshaw, Sean; Masterson, Alec; Grinnan, Harrison

Global Event Crawler and Seed Generator for GETAR

dc.contributor.author	Manchester, Emma	en
dc.contributor.author	Srinivasan, Ravi	en
dc.contributor.author	Crenshaw, Sean	en
dc.contributor.author	Masterson, Alec	en
dc.contributor.author	Grinnan, Harrison	en
dc.date.accessioned	2017-05-13T00:34:44Z	en
dc.date.available	2017-05-13T00:34:44Z	en
dc.date.issued	2017-04-28	en
dc.description.abstract	Global Event and Trend Archive Research (GETAR) is a research project at Virginia Tech, studying the years from 1997 to 2020, which seeks to investigate and catalog events as they happen in support of future research. It will devise interactive and integrated digital library and archive systems coupled with linked and expert-curated web page and tweet collections. This historical record enables research on trends as history develops and captures valuable primary sources that would otherwise not be archived. An important capability of this project is the ability to predict which sources and stories will be most important in the future in order to prioritize those stories for archiving. It is in that space that our project will be most important. In support of GETAR, this project will build a powerful tool to scrape the news to identify important global events. It will generate seeds that contain relevant information like a link, the topic, person, organization, source, etc. The seeds can then be used by others working on GETAR to collect webpages and tweets using tools like the Event Focused Crawler and Twitter Search. To achieve this goal, the Global Event Detector (GED) will crawl Reddit to determine possible important news stories. These stories will be grouped, and the top groupings will be displayed on a website as well as a display in Torgersen Hall. This project will serve future research for the GETAR project, as well as those seeking real time updates on events currently trending. The final deliverables discussed in this report includes code that scrapes Reddit and processes the data, and the webpage that visualizes the data.	en
dc.description.notes	Additional directories necessary to run the code that can be found on our team's server: Host: 128.173.49.98 Port: 3306 Code Files: poller.py - The poller script is responsible for scraping the most popular news stories off of Reddit, and storing the information gathered into the raw database. article.py - This file contains a definition for the NewsArticle object. The NewsArticle object information about a news article object. Certain fields of the NewsArticle object are populated from the raw database. articleCluster.py - This file contains a definition for a Cluster object. A cluster object contains information relevant to each cluster, and it’s fields get stored in the 2 cluster databases. processNews.py - This file parses article content, clusters articles, and extracts seeds from article content. driver.sh - A bash wrapper script calls each Python script sequentially, every 12 hours. .htaccess - This is the file that manages all of the redirects for the entire website. config.php - This file serves as the configuration file for the entire website. global.php - For our website, the global.php file globally defines our config.php so that all of the other files in our website can access the variables defined there and autoloads any objects that we may have defined in our model. siteController.php - This is the sole controller that we have for our website that defines the actions that we need to access and manipulate information in our database so that it can be displayed in our visualizations. home.tpl - This is the template webpage file that the server displays when someone visits the homepage. public/ - This is a directory that contains all publicly accessible files used for our website. GEDcode.zip - The zip file that contains all of the code for our project. GEDreport.docx - The editable Microsoft Word document containing the content of our final report. GEDreport.pdf - The PDF file containing the content of our final report. GEDpresentation.pptx - The editable Microsoft PowerPoint file containing our the content of our final presentation. GEDpresentation.pdf - The PDF file containing the content of our final presentation. Details: enwiki_dbow —> distributed bag of words GoogleNews-vectors-negative300.bin —> pre-trained news vectors stanford-ner-2016-10-31 —> Stanford Name Entity Recognizer (SNER) w2v —> word to vector	en
dc.description.sponsorship	NSF Grant No. IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/77620	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	GETAR	en
dc.subject	CS 4624	en
dc.subject	Global Event Detector	en
dc.subject	D3.js	en
dc.subject	SNER	en
dc.subject	NLTK	en
dc.subject	Cluster	en
dc.subject	News	en
dc.subject	Reddit	en
dc.subject	digital library	en
dc.subject	webpage	en
dc.subject	tweet	en
dc.title	Global Event Crawler and Seed Generator for GETAR	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en