Big Data Text Summarization for the NeverAgain Movement

Arora, Anuj; Miller, Chreston; Fan, Jixiang; Liu, Shuai; Han, Yi

Big Data Text Summarization for the NeverAgain Movement

dc.contributor.author	Arora, Anuj	en
dc.contributor.author	Miller, Chreston	en
dc.contributor.author	Fan, Jixiang	en
dc.contributor.author	Liu, Shuai	en
dc.contributor.author	Han, Yi	en
dc.date.accessioned	2018-12-12T21:06:49Z	en
dc.date.available	2018-12-12T21:06:49Z	en
dc.date.issued	2018-12-10	en
dc.description.abstract	When you are browsing social media websites such as Twitter and Facebook, have you ever seen hashtags like #NeverAgain and #EnoughIsEnough? Do you know what they mean? Never Again is an American student-led political movement for gun control to prevent gun violence. In the United States, gun control has long been debated. According to the data from the Gun Violence Archive (http://www.shootingtracker.com/), in 2017, the U.S. saw a total of 346 mass shootings. Supporters claim that the proliferation of firearms is the direct spark of a series of social unrest factors such as robbery, sexual crimes, and theft, while others believe the gun culture represents an integral part of their freedom. For the Never Again Gun Control Movement, we would like to generate a human readable summary based on deep learning methods so that one can study incidents of gun violence that shocked the world such as the 2017 Las Vegas shooting, in order to figure out the impact of gun proliferation. Our project includes three steps: pre-processing, topic modeling, and abstractive summarization using deep learning. We began with a large collection of news articles associated with the #NeverAgain movement. The raw news articles needed to be pre-processed in multiple ways. An ArchiveSpark script was used to convert the WARC and CDX files to a readable and parseable JSON. However, we figured out that at least forty percent of the data was noise. A series of restrictive word filters was applied to remove noise. After noise removal, we identified the most frequent words to get a preliminary idea whether we were filtering noise properly. We used the Natural Language Toolkit’s (NLTK) Named Entity chunker to generate named entities, which are phrases that form important nouns (people, places, organizations, etc.) in a sentence. For Topic Modeling, we classified sentences into different buckets or topics, which identified distinct themes in the collection. While we were performing the dictionary creation and document vectorization, the Latent Dirichlet allocation algorithm (for topic modeling) did not take the normalized and tokenized word corpus directly. It had to be converted into a vector for each article in the collection. We chose to use the Bag Of Words (BOW) approach. The Bag Of Words method is a simplifying representation used in natural language processing and information retrieval. In this model, text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order, but keeping multiplicity. According to topic modeling, we needed to choose the number of topics, which means one must guess how many topics are present in a collection. There is no foolproof way of replacing human logic to weave keywords into topics with semantic meaning. To address this we tried the coherence score approach. Coherence score is an attempt to mimic the human readability of the topic, and the higher the coherence score, the more ”coherent” the topics are considered. The last step for topic modeling is Latent Dirichlet Allocation (LDA). Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. Compared with some other algorithms, LDA is a probabilistic one, which means that LDA is better at handling topic mixtures in different documents. In addition, LDA identifies topics coherently whereas the topics from other algorithms are more disjoint. After we had our topics (three in total), we filtered the article collection based on these topics. What resulted was three distinct collections of articles on which we could apply an abstractive summarization algorithm to produce a coherent summary. We chose to use a Pointer-Generator Network (PGN), a deep learning approach designed to create abstractive summaries, to produce said summaries. We created a summary for each identified topic and performed post-processing to produce one summary that connected the three topics (which are related) into a summary that flowed. The result was a summary that reflected the main themes of the article collection and informed the reader of the contents of said collection in less than two pages.	en
dc.description.notes	Description of files of this collection: - NeverAgain_report_in_PDF_format.pdf: The final report of the project in PDF format. - NeverAgain_Report_Latex_Material.zip: A zip file containing the source material of the LaTeX version of the final report. - NeverAgain_ ZIP_file_of_source_code.zip: A zip file containing all the source code of the project. - NeverAgain_presentation_in_powerpoint: The final presentation in Microsoft PowerPoint format. - NeverAgain_presentation_in_pdf: The final presentation in PDF format.	en
dc.description.sponsorship	NSF IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/86357	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution-NonCommercial 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/us/	en
dc.subject	Text Summarization	en
dc.subject	Abstractive summary	en
dc.subject	Webpage Collection	en
dc.subject	Natural language processing	en
dc.subject	NLP	en
dc.subject	Natural language generation	en
dc.subject	NLG	en
dc.subject	Deep learning (Machine learning)	en
dc.subject	Information extraction	en
dc.subject	Topic analysis	en
dc.subject	Template	en
dc.subject	Named entity recognition	en
dc.subject	NER	en
dc.title	Big Data Text Summarization for the NeverAgain Movement	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en