A Cost-Effective Semi-Automated Approach for Comprehensive Event Extraction

Saraf, Parang

A Cost-Effective Semi-Automated Approach for Comprehensive Event Extraction

dc.contributor.author	Saraf, Parang	en
dc.contributor.committeechair	Ramakrishnan, Naren	en
dc.contributor.committeemember	House, Leanna L.	en
dc.contributor.committeemember	Corley, Courtney	en
dc.contributor.committeemember	North, Christopher L.	en
dc.contributor.committeemember	Lu, Chang-Tien	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2018-04-27T08:00:25Z	en
dc.date.available	2018-04-27T08:00:25Z	en
dc.date.issued	2018-04-26	en
dc.description.abstract	Automated event extraction from free text remains an open problem, particularly when the goal is to identify all relevant events. Manual extraction is currently the only alternative for comprehensive and reliable extraction. Therefore, it is required to have a system that can comprehensively extract events reported in news articles (high recall) and is also scalable enough to handle a large number of articles. In this dissertation, we explore various methods to develop an event extraction system that can mitigate these challenges. We primarily investigate three major problems related to event extraction as follows. (i) What are the strengths and weaknesses of the automated event extractors? A thorough understanding of what can be automated with high success and what leads to common pitfalls is crucial before we could develop a superior event extraction system. (ii) How can we build a hybrid event extraction system that can bridge the gap between manual and automated event extraction? Hybrid extraction is a semi-automated approach that uses an ecosystem of machine learning models along with a carefully designed user interface for extracting events. Since this method is semi-automated it also requires a meticulous understanding of user behavior in order to identify tasks that humans can perform with ease while diverting the more tedious task to the machine learning methods (iii) Finally, we explore methods for displaying extracted events that could simplify the analytical and inference generation processes for an analyst. We particularly aim to develop visualizations that would allow analysts can perform macro and micro level analysis of significant societal events.	en
dc.description.abstractgeneral	News articles provide information about who did what to whom, when, where, and why. Extracting this structured information from news articles can allow scientific evaluation of widely believed information. However, curating these databases of structured information is not a trivial task. Currently there are two main approaches: manual and automated. Manually curation is not scalable due to labor costs: adding more humans to perform analysis is prohibitively expensive and time consuming. The alternative approach is ‘Automated Extraction’, wherein, machine learning algorithms extract events on their own without any human assistance. Even though this approach can easily scale to work with a large number of articles, it frequently misclassifies events. In this dissertation, we present EMBERS AutoGSR, a framework for comprehensively extracting ‘protest’ events reported in news articles using Hybrid Event Extraction. In the hybrid approach, we use an ecosystem of Filtering, Ranking, and Recommendation models to determine if an article is reporting a protest and, if so, proceed to identify and encode specific characteristics of the event, such as who protested when, where and why? These extracted events are then displayed on an interactive web-based interface that allows manual validation. This manual validation, in turn, helps the automated event extractors learn and evolve from user feedback and error correction. The interface is carefully designed with an aim to minimize the manual effort required for user validation, thereby making it feasible and viable to work with a large number of articles. EMBERS AutoGSR operated 24x7 for a year from October 2015 through September 2016, during which it extracted protest events from news articles that were collected from 19 countries across 8 languages. These extracted events were validated by 12 subject matter experts. The system was evaluated by an independent third party, MITRE corporation. They compared EMBERS AutoGSR events with events that were manually extracted by their team of political scientists. AutoGSR achieved a recall of 0.82 out of 1, and reduced the manual effort required for event extraction by 72%, thereby making the system extremely reliable and scalable.	en
dc.description.degree	Ph. D.	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:15211	en
dc.identifier.uri	http://hdl.handle.net/10919/82926	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Event Extraction	en
dc.subject	Visual Analytics	en
dc.subject	News Analytics	en
dc.subject	Civil Unrest	en
dc.title	A Cost-Effective Semi-Automated Approach for Comprehensive Event Extraction	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Saraf_P_D_2018.pdf
Size:: 34.81 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations