Global Event Crawler and Seed Generator for GETAR
dc.contributor.author | Manchester, Emma | en |
dc.contributor.author | Srinivasan, Ravi | en |
dc.contributor.author | Crenshaw, Sean | en |
dc.contributor.author | Masterson, Alec | en |
dc.contributor.author | Grinnan, Harrison | en |
dc.date.accessioned | 2017-05-13T00:34:44Z | en |
dc.date.available | 2017-05-13T00:34:44Z | en |
dc.date.issued | 2017-04-28 | en |
dc.description.abstract | Global Event and Trend Archive Research (GETAR) is a research project at Virginia Tech, studying the years from 1997 to 2020, which seeks to investigate and catalog events as they happen in support of future research. It will devise interactive and integrated digital library and archive systems coupled with linked and expert-curated web page and tweet collections. This historical record enables research on trends as history develops and captures valuable primary sources that would otherwise not be archived. An important capability of this project is the ability to predict which sources and stories will be most important in the future in order to prioritize those stories for archiving. It is in that space that our project will be most important. In support of GETAR, this project will build a powerful tool to scrape the news to identify important global events. It will generate seeds that contain relevant information like a link, the topic, person, organization, source, etc. The seeds can then be used by others working on GETAR to collect webpages and tweets using tools like the Event Focused Crawler and Twitter Search. To achieve this goal, the Global Event Detector (GED) will crawl Reddit to determine possible important news stories. These stories will be grouped, and the top groupings will be displayed on a website as well as a display in Torgersen Hall. This project will serve future research for the GETAR project, as well as those seeking real time updates on events currently trending. The final deliverables discussed in this report includes code that scrapes Reddit and processes the data, and the webpage that visualizes the data. | en |
dc.description.notes | Additional directories necessary to run the code that can be found on our team's server: Host: 128.173.49.98 Port: 3306 Code Files: poller.py - The poller script is responsible for scraping the most popular news stories off of Reddit, and storing the information gathered into the raw database. article.py - This file contains a definition for the NewsArticle object. The NewsArticle object information about a news article object. Certain fields of the NewsArticle object are populated from the raw database. articleCluster.py - This file contains a definition for a Cluster object. A cluster object contains information relevant to each cluster, and it’s fields get stored in the 2 cluster databases. processNews.py - This file parses article content, clusters articles, and extracts seeds from article content. driver.sh - A bash wrapper script calls each Python script sequentially, every 12 hours. .htaccess - This is the file that manages all of the redirects for the entire website. config.php - This file serves as the configuration file for the entire website. global.php - For our website, the global.php file globally defines our config.php so that all of the other files in our website can access the variables defined there and autoloads any objects that we may have defined in our model. siteController.php - This is the sole controller that we have for our website that defines the actions that we need to access and manipulate information in our database so that it can be displayed in our visualizations. home.tpl - This is the template webpage file that the server displays when someone visits the homepage. public/ - This is a directory that contains all publicly accessible files used for our website. GEDcode.zip - The zip file that contains all of the code for our project. GEDreport.docx - The editable Microsoft Word document containing the content of our final report. GEDreport.pdf - The PDF file containing the content of our final report. GEDpresentation.pptx - The editable Microsoft PowerPoint file containing our the content of our final presentation. GEDpresentation.pdf - The PDF file containing the content of our final presentation. Details: enwiki_dbow —> distributed bag of words GoogleNews-vectors-negative300.bin —> pre-trained news vectors stanford-ner-2016-10-31 —> Stanford Name Entity Recognizer (SNER) w2v —> word to vector | en |
dc.description.sponsorship | NSF Grant No. IIS-1619028 | en |
dc.identifier.uri | http://hdl.handle.net/10919/77620 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | GETAR | en |
dc.subject | CS 4624 | en |
dc.subject | Global Event Detector | en |
dc.subject | D3.js | en |
dc.subject | SNER | en |
dc.subject | NLTK | en |
dc.subject | Cluster | en |
dc.subject | News | en |
dc.subject | en | |
dc.subject | digital library | en |
dc.subject | webpage | en |
dc.subject | tweet | en |
dc.title | Global Event Crawler and Seed Generator for GETAR | en |
dc.type | Presentation | en |
dc.type | Report | en |
dc.type | Software | en |
Files
Original bundle
1 - 5 of 5
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: