Hybrid Summarization of Dakota Access Pipeline Protests (NoDAPL)

Chen, Xiaoyu; Wang, Haitao; Mehrotra, Maanav; Chhikara, Naman; Sun, Di

Hybrid Summarization of Dakota Access Pipeline Protests (NoDAPL)

dc.contributor.author	Chen, Xiaoyu	en
dc.contributor.author	Wang, Haitao	en
dc.contributor.author	Mehrotra, Maanav	en
dc.contributor.author	Chhikara, Naman	en
dc.contributor.author	Sun, Di	en
dc.date.accessioned	2018-12-14T16:28:48Z	en
dc.date.available	2018-12-14T16:28:48Z	en
dc.date.issued	2018-12-14	en
dc.description.abstract	Dakota Access Pipeline Protests (known with the hashtag #NoDAPL) are grassroots movements that began in April 2016 in reaction to the approved construction of Energy Transfer Partners’ Dakota Access Pipeline in the northern United States. The NoDAPL movements produce many FaceBook messages, tweets, blogs, and news, which reflect different aspects of the NoDAPL events. The related information keeps increasing rapidly, which makes it difficult to understand the events in an efficient manner. Therefore, it is invaluable to automatically or at least semi-automatically generate short summaries based on the online available big data. Motivated by this automatic summarization need, the objective of this project is to propose a novel automatic summarization approach to efficiently and effectively summarize the topics hidden in the online big text data. Although automatic summarization has been investigated for more than 60 years since the publication of Luhn’s 1958 seminal paper, several challenges exist in summarizing online big text sets, such as large proportion of noise texts, highly redundant information, multiple latent topics, etc. Therefore, we propose an automatic framework with minimal human efforts to summarize big online text sets (~11,000 documents on NoDAPL) according to latent topics with nonrelevant information removed. This framework provides a hybrid model to combine the advantages of latent Dirichlet allocation (LDA) based extractive and deep-learning based abstractive methods. Different from semi-automatic summarization approaches such as template-based summarization, the proposed method does not require a deep understanding of the events from the practitioners to create the template nor to fill in the template by using regular expressions. During the procedure, the only human effort needed is to manually label a few (say, 100) documents as relevant and irrelevant. We evaluate the quality of the generated automatic summary with both extrinsic and intrinsic measurement. In the extrinsic subjective evaluation, we design a set of guideline questions and conduct a task-based measurement. Results show that 91.3% of sentences are within the scope of the guideline, and 69.6% of the outlined questions can be answered by reading the generated summary. The intrinsic ROUGE measurements show our entity coverage is a total of 2.6% and ROUGE L and ROUGE SU4 scores are 0.148 and 0.065. Overall, the proposed hybrid model achieves decent performance on summarizing NoDAPL events. Future work includes testing of the approach with more textual datasets for interesting topics, and investigation of topic modeling-supervised classification approach to minimize human efforts in automatic summarization. Besides, we also would like to investigate a deep learning-based recommender system for better sentence re-ranking.	en
dc.description.notes	This submission includes: 1) Final_Report_NoDAPL_Team8_Submitted.docx: final report in .docx format; 2) Final_Report_NoDAPL_Team8_Submitted.pdf: final report in .pdf format; 3) Final_Presentation_NoDAPL_Team8_Submitted.pptx: final presentation in .pptx format; 4) Final_Presentation_NoDAPL_Team8_Submitted.pdf: final presentation in .pdf format; 5) Source_Code_Hybrid_Summarization_of_NoDAPL_CS4984CS5984_2018.zip: final collection of source code in .zip format with READEME.md	en
dc.description.sponsorship	National Science Foundation	en
dc.description.sponsorship	NSF: IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/86401	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	en
dc.subject	Text Summarization	en
dc.subject	Hybrid Model	en
dc.subject	NoDAPL	en
dc.subject	Deep learning (Machine learning)	en
dc.subject	Natural Language Processing	en
dc.title	Hybrid Summarization of Dakota Access Pipeline Protests (NoDAPL)	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en