Global Event Crawler and Seed Generator for GETAR

TR Number

Date

2017-04-28

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Global Event and Trend Archive Research (GETAR) is a research project at Virginia Tech, studying the years from 1997 to 2020, which seeks to investigate and catalog events as they happen in support of future research. It will devise interactive and integrated digital library and archive systems coupled with linked and expert-curated web page and tweet collections. This historical record enables research on trends as history develops and captures valuable primary sources that would otherwise not be archived. An important capability of this project is the ability to predict which sources and stories will be most important in the future in order to prioritize those stories for archiving. It is in that space that our project will be most important.

In support of GETAR, this project will build a powerful tool to scrape the news to identify important global events. It will generate seeds that contain relevant information like a link, the topic, person, organization, source, etc. The seeds can then be used by others working on GETAR to collect webpages and tweets using tools like the Event Focused Crawler and Twitter Search. To achieve this goal, the Global Event Detector (GED) will crawl Reddit to determine possible important news stories. These stories will be grouped, and the top groupings will be displayed on a website as well as a display in Torgersen Hall.

This project will serve future research for the GETAR project, as well as those seeking real time updates on events currently trending.

The final deliverables discussed in this report includes code that scrapes Reddit and processes the data, and the webpage that visualizes the data.

Description

Keywords

GETAR, CS 4624, Global Event Detector, D3.js, SNER, NLTK, Cluster, News, Reddit, digital library, webpage, tweet

Citation