Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipse

Hamilton, Leah; Robb, Esther; Fitzpatrick, April; Goel, Akshay; Nandigam, Ramya

Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipse

dc.contributor.author	Hamilton, Leah	en
dc.contributor.author	Robb, Esther	en
dc.contributor.author	Fitzpatrick, April	en
dc.contributor.author	Goel, Akshay	en
dc.contributor.author	Nandigam, Ramya	en
dc.date.accessioned	2018-12-14T15:26:38Z	en
dc.date.available	2018-12-14T15:26:38Z	en
dc.date.issued	2018-12-13	en
dc.description.abstract	Summarization is often a time-consuming task for humans. Automated methods can summarize a larger volume of source material in a shorter amount of time, but creating a good summary with these methods remains challenging. This submission contains all work related to a semester-long project in CS 4984/5984 to generate the best possible summary of a collection of 10,829 web pages about the Facebook-Cambridge Analytica data breach, with some early prototyping done on 500 web pages about the 2017 Solar Eclipse. A final report, a final presentation, and several archives of code, input data, and results are included. The work implements basic natural language processing techniques such as word frequency, lemmatization, and part-of-speech tagging, working up to a complete human-readable summary at the end of the course. Extractive, abstractive, and combination methods were used to generate the final summaries, all of which are included and the results compared. The summary subjectively evaluated as best was a purely extractive summary built from concatenating summaries of document categories. This method was coherent and thorough, but involved manual tuning to select categories and still had some redundancy. All attempted methods are described and the less successful summaries are also included. This report presents a framework for how to summarize complex document collections with multiple relevant topics. The summary itself identifies information which was most covered about the Facebook-Cambridge Analytica data breach and is a reasonable introduction to the topic.	en
dc.description.notes	Files Included: FacebookBreachSummarization_FinalReport.pdf - The final report covering the approaches, results, and lessons learned as a part of this project. Includes the resulting summaries as appendices, the results from earlier natural language approaches as tables and figures, and a breakdown of the file structures of included code archives. FacebookBreachSummarization_FinalReport.docx - An editable version of the final report. May not display correctly on all systems. FacebookBreachSummarization_FinalPresentation.pdf - The slides used to give the final presentation for CS 4984/5984. An overview of important results and takeaways that was intended to be presented in 10 minutes. FacebookBreachSummarization_FinalPresentation.pptx - An editable version of the final presentation. May not display correctly on all systems. FacebookBreachSummarization_CodeAndResults.zip - The majority of the code used to obtain the results covered in the report ant presentation, along with input datafiles and results where the results are saved to a file instead of written to the console. If looking to replicate results, start here. See the report for details on the file structure. FacebookBreachSummarization_FastAbsRLFork.zip - A clone of the original fast_abs_rl repository (https://github.com/ChenRocks/fast_abs_rl), modified to work with our corpus. One method of generating abstractive summaries that was tested in this project. FacebookBreachSummarization_FilesForAbsSum.zip - A directory containing cleaned article text from the Facebook corpus saved as .story files for use with the abstractive summarizers. FacebookBreachSummarization_PySparkExtPkgs.zip - A zipped archive of the packages needed to run the PySpark code included in this project which are not a part of base Python. For ease of running the included PySpark code on Python 2.7. FacebookBreackSummarization_PretrainedPGN.zip - A copy of the pretrained pointer-generator network model which is also available through a Google Drive link on https://github.com/abisee/pointer-generator. Works with TensorFlow 1.2.1. Can be used to rapidly generate single-document abstractive summaries without having to train a new model.	en
dc.description.sponsorship	Global Event and Trend Archive Research (GETAR) project	en
dc.description.sponsorship	NSF: IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/86395	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution-NonCommercial 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/us/	en
dc.subject	natural language processing	en
dc.subject	summarization	en
dc.subject	deep learning	en
dc.subject	computer science	en
dc.subject	news articles	en
dc.subject	data breach	en
dc.subject	Facebook	en
dc.subject	solar eclipse	en
dc.subject	Cambridge Analytica	en
dc.subject	abstractive summarization	en
dc.subject	extractive summarization	en
dc.title	Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipse	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en