Show simple item record

dc.contributor.authorManchester, Emma
dc.contributor.authorSrinivasan, Ravi
dc.contributor.authorCrenshaw, Sean
dc.contributor.authorMasterson, Alec
dc.contributor.authorGrinnan, Harrison
dc.date.accessioned2017-05-13T00:34:44Z
dc.date.available2017-05-13T00:34:44Z
dc.date.issued2017-04-28
dc.identifier.urihttp://hdl.handle.net/10919/77620
dc.description.abstractGlobal Event and Trend Archive Research (GETAR) is a research project at Virginia Tech, studying the years from 1997 to 2020, which seeks to investigate and catalog events as they happen in support of future research. It will devise interactive and integrated digital library and archive systems coupled with linked and expert-curated web page and tweet collections. This historical record enables research on trends as history develops and captures valuable primary sources that would otherwise not be archived. An important capability of this project is the ability to predict which sources and stories will be most important in the future in order to prioritize those stories for archiving. It is in that space that our project will be most important. In support of GETAR, this project will build a powerful tool to scrape the news to identify important global events. It will generate seeds that contain relevant information like a link, the topic, person, organization, source, etc. The seeds can then be used by others working on GETAR to collect webpages and tweets using tools like the Event Focused Crawler and Twitter Search. To achieve this goal, the Global Event Detector (GED) will crawl Reddit to determine possible important news stories. These stories will be grouped, and the top groupings will be displayed on a website as well as a display in Torgersen Hall. This project will serve future research for the GETAR project, as well as those seeking real time updates on events currently trending. The final deliverables discussed in this report includes code that scrapes Reddit and processes the data, and the webpage that visualizes the data.en_US
dc.description.sponsorshipNSF Grant No. IIS-1619028en_US
dc.language.isoen_USen_US
dc.publisherVirginia Techen_US
dc.subjectGETARen_US
dc.subjectCS 4624en_US
dc.subjectGlobal Event Detectoren_US
dc.subjectD3.jsen_US
dc.subjectSNERen_US
dc.subjectNLTKen_US
dc.subjectClusteren_US
dc.subjectNewsen_US
dc.subjectRedditen_US
dc.subjectdigital libraryen_US
dc.subjectwebpageen_US
dc.subjecttweeten_US
dc.titleGlobal Event Crawler and Seed Generator for GETARen_US
dc.typePresentationen_US
dc.typeReporten_US
dc.typeSoftwareen_US
dc.description.notesAdditional directories necessary to run the code that can be found on our team's server: Host: 128.173.49.98 Port: 3306 Code Files: poller.py - The poller script is responsible for scraping the most popular news stories off of Reddit, and storing the information gathered into the raw database. article.py - This file contains a definition for the NewsArticle object. The NewsArticle object information about a news article object. Certain fields of the NewsArticle object are populated from the raw database. articleCluster.py - This file contains a definition for a Cluster object. A cluster object contains information relevant to each cluster, and it’s fields get stored in the 2 cluster databases. processNews.py - This file parses article content, clusters articles, and extracts seeds from article content. driver.sh - A bash wrapper script calls each Python script sequentially, every 12 hours. .htaccess - This is the file that manages all of the redirects for the entire website. config.php - This file serves as the configuration file for the entire website. global.php - For our website, the global.php file globally defines our config.php so that all of the other files in our website can access the variables defined there and autoloads any objects that we may have defined in our model. siteController.php - This is the sole controller that we have for our website that defines the actions that we need to access and manipulate information in our database so that it can be displayed in our visualizations. home.tpl - This is the template webpage file that the server displays when someone visits the homepage. public/ - This is a directory that contains all publicly accessible files used for our website. GEDcode.zip - The zip file that contains all of the code for our project. GEDreport.docx - The editable Microsoft Word document containing the content of our final report. GEDreport.pdf - The PDF file containing the content of our final report. GEDpresentation.pptx - The editable Microsoft PowerPoint file containing our the content of our final presentation. GEDpresentation.pdf - The PDF file containing the content of our final presentation. Details: enwiki_dbow —> distributed bag of words GoogleNews-vectors-negative300.bin —> pre-trained news vectors stanford-ner-2016-10-31 —> Stanford Name Entity Recognizer (SNER) w2v —> word to vectoren_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record