Global Event Crawler and Seed Generator for GETAR

dc.contributor.authorManchester, Emmaen
dc.contributor.authorSrinivasan, Ravien
dc.contributor.authorCrenshaw, Seanen
dc.contributor.authorMasterson, Alecen
dc.contributor.authorGrinnan, Harrisonen
dc.date.accessioned2017-05-13T00:34:44Zen
dc.date.available2017-05-13T00:34:44Zen
dc.date.issued2017-04-28en
dc.description.abstractGlobal Event and Trend Archive Research (GETAR) is a research project at Virginia Tech, studying the years from 1997 to 2020, which seeks to investigate and catalog events as they happen in support of future research. It will devise interactive and integrated digital library and archive systems coupled with linked and expert-curated web page and tweet collections. This historical record enables research on trends as history develops and captures valuable primary sources that would otherwise not be archived. An important capability of this project is the ability to predict which sources and stories will be most important in the future in order to prioritize those stories for archiving. It is in that space that our project will be most important. In support of GETAR, this project will build a powerful tool to scrape the news to identify important global events. It will generate seeds that contain relevant information like a link, the topic, person, organization, source, etc. The seeds can then be used by others working on GETAR to collect webpages and tweets using tools like the Event Focused Crawler and Twitter Search. To achieve this goal, the Global Event Detector (GED) will crawl Reddit to determine possible important news stories. These stories will be grouped, and the top groupings will be displayed on a website as well as a display in Torgersen Hall. This project will serve future research for the GETAR project, as well as those seeking real time updates on events currently trending. The final deliverables discussed in this report includes code that scrapes Reddit and processes the data, and the webpage that visualizes the data.en
dc.description.notesAdditional directories necessary to run the code that can be found on our team's server: Host: 128.173.49.98 Port: 3306 Code Files: poller.py - The poller script is responsible for scraping the most popular news stories off of Reddit, and storing the information gathered into the raw database. article.py - This file contains a definition for the NewsArticle object. The NewsArticle object information about a news article object. Certain fields of the NewsArticle object are populated from the raw database. articleCluster.py - This file contains a definition for a Cluster object. A cluster object contains information relevant to each cluster, and it’s fields get stored in the 2 cluster databases. processNews.py - This file parses article content, clusters articles, and extracts seeds from article content. driver.sh - A bash wrapper script calls each Python script sequentially, every 12 hours. .htaccess - This is the file that manages all of the redirects for the entire website. config.php - This file serves as the configuration file for the entire website. global.php - For our website, the global.php file globally defines our config.php so that all of the other files in our website can access the variables defined there and autoloads any objects that we may have defined in our model. siteController.php - This is the sole controller that we have for our website that defines the actions that we need to access and manipulate information in our database so that it can be displayed in our visualizations. home.tpl - This is the template webpage file that the server displays when someone visits the homepage. public/ - This is a directory that contains all publicly accessible files used for our website. GEDcode.zip - The zip file that contains all of the code for our project. GEDreport.docx - The editable Microsoft Word document containing the content of our final report. GEDreport.pdf - The PDF file containing the content of our final report. GEDpresentation.pptx - The editable Microsoft PowerPoint file containing our the content of our final presentation. GEDpresentation.pdf - The PDF file containing the content of our final presentation. Details: enwiki_dbow —> distributed bag of words GoogleNews-vectors-negative300.bin —> pre-trained news vectors stanford-ner-2016-10-31 —> Stanford Name Entity Recognizer (SNER) w2v —> word to vectoren
dc.description.sponsorshipNSF Grant No. IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/77620en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectGETARen
dc.subjectCS 4624en
dc.subjectGlobal Event Detectoren
dc.subjectD3.jsen
dc.subjectSNERen
dc.subjectNLTKen
dc.subjectClusteren
dc.subjectNewsen
dc.subjectRedditen
dc.subjectdigital libraryen
dc.subjectwebpageen
dc.subjecttweeten
dc.titleGlobal Event Crawler and Seed Generator for GETARen
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
GEDcode.zip
Size:
1.13 MB
Format:
Name:
GEDpresentation.pptx
Size:
8.42 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
GEDpresentation.pdf
Size:
674.55 KB
Format:
Adobe Portable Document Format
Name:
GEDreport.docx
Size:
2.53 MB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
GEDreport.pdf
Size:
3.68 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: