Big Data Text Summarization for the NeverAgain Movement

dc.contributor.authorArora, Anujen
dc.contributor.authorMiller, Chrestonen
dc.contributor.authorFan, Jixiangen
dc.contributor.authorLiu, Shuaien
dc.contributor.authorHan, Yien
dc.date.accessioned2018-12-12T21:06:49Zen
dc.date.available2018-12-12T21:06:49Zen
dc.date.issued2018-12-10en
dc.description.abstractWhen you are browsing social media websites such as Twitter and Facebook, have you ever seen hashtags like #NeverAgain and #EnoughIsEnough? Do you know what they mean? Never Again is an American student-led political movement for gun control to prevent gun violence. In the United States, gun control has long been debated. According to the data from the Gun Violence Archive (http://www.shootingtracker.com/), in 2017, the U.S. saw a total of 346 mass shootings. Supporters claim that the proliferation of firearms is the direct spark of a series of social unrest factors such as robbery, sexual crimes, and theft, while others believe the gun culture represents an integral part of their freedom. For the Never Again Gun Control Movement, we would like to generate a human readable summary based on deep learning methods so that one can study incidents of gun violence that shocked the world such as the 2017 Las Vegas shooting, in order to figure out the impact of gun proliferation. Our project includes three steps: pre-processing, topic modeling, and abstractive summarization using deep learning. We began with a large collection of news articles associated with the #NeverAgain movement. The raw news articles needed to be pre-processed in multiple ways. An ArchiveSpark script was used to convert the WARC and CDX files to a readable and parseable JSON. However, we figured out that at least forty percent of the data was noise. A series of restrictive word filters was applied to remove noise. After noise removal, we identified the most frequent words to get a preliminary idea whether we were filtering noise properly. We used the Natural Language Toolkit’s (NLTK) Named Entity chunker to generate named entities, which are phrases that form important nouns (people, places, organizations, etc.) in a sentence. For Topic Modeling, we classified sentences into different buckets or topics, which identified distinct themes in the collection. While we were performing the dictionary creation and document vectorization, the Latent Dirichlet allocation algorithm (for topic modeling) did not take the normalized and tokenized word corpus directly. It had to be converted into a vector for each article in the collection. We chose to use the Bag Of Words (BOW) approach. The Bag Of Words method is a simplifying representation used in natural language processing and information retrieval. In this model, text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order, but keeping multiplicity. According to topic modeling, we needed to choose the number of topics, which means one must guess how many topics are present in a collection. There is no foolproof way of replacing human logic to weave keywords into topics with semantic meaning. To address this we tried the coherence score approach. Coherence score is an attempt to mimic the human readability of the topic, and the higher the coherence score, the more ”coherent” the topics are considered. The last step for topic modeling is Latent Dirichlet Allocation (LDA). Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. Compared with some other algorithms, LDA is a probabilistic one, which means that LDA is better at handling topic mixtures in different documents. In addition, LDA identifies topics coherently whereas the topics from other algorithms are more disjoint. After we had our topics (three in total), we filtered the article collection based on these topics. What resulted was three distinct collections of articles on which we could apply an abstractive summarization algorithm to produce a coherent summary. We chose to use a Pointer-Generator Network (PGN), a deep learning approach designed to create abstractive summaries, to produce said summaries. We created a summary for each identified topic and performed post-processing to produce one summary that connected the three topics (which are related) into a summary that flowed. The result was a summary that reflected the main themes of the article collection and informed the reader of the contents of said collection in less than two pages.en
dc.description.notesDescription of files of this collection: - NeverAgain_report_in_PDF_format.pdf: The final report of the project in PDF format. - NeverAgain_Report_Latex_Material.zip: A zip file containing the source material of the LaTeX version of the final report. - NeverAgain_ ZIP_file_of_source_code.zip: A zip file containing all the source code of the project. - NeverAgain_presentation_in_powerpoint: The final presentation in Microsoft PowerPoint format. - NeverAgain_presentation_in_pdf: The final presentation in PDF format.en
dc.description.sponsorshipNSF IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/86357en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/us/en
dc.subjectText Summarizationen
dc.subjectAbstractive summaryen
dc.subjectWebpage Collectionen
dc.subjectNatural language processingen
dc.subjectNLPen
dc.subjectNatural language generationen
dc.subjectNLGen
dc.subjectDeep learning (Machine learning)en
dc.subjectInformation extractionen
dc.subjectTopic analysisen
dc.subjectTemplateen
dc.subjectNamed entity recognitionen
dc.subjectNERen
dc.titleBig Data Text Summarization for the NeverAgain Movementen
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
ZIP_file_of_source_code.zip
Size:
56.19 KB
Format:
Name:
NeverAgain_Report_Latex_Material.zip
Size:
21.46 MB
Format:
Loading...
Thumbnail Image
Name:
NeverAgain_report_in_PDF_format.pdf
Size:
29.99 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
NeverAgain_Final_Presentation_in_pdf.pdf
Size:
1.17 MB
Format:
Adobe Portable Document Format
Name:
NeverAgain_Final_Presentation_in_powerpoint.pptx
Size:
2.15 MB
Format:
Microsoft Powerpoint XML
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: