Show simple item record

dc.contributor.authorHamilton, Leah
dc.contributor.authorRobb, Esther
dc.contributor.authorFitzpatrick, April
dc.contributor.authorGoel, Akshay
dc.contributor.authorNandigam, Ramya
dc.description.abstractSummarization is often a time-consuming task for humans. Automated methods can summarize a larger volume of source material in a shorter amount of time, but creating a good summary with these methods remains challenging. This submission contains all work related to a semester-long project in CS 4984/5984 to generate the best possible summary of a collection of 10,829 web pages about the Facebook-Cambridge Analytica data breach, with some early prototyping done on 500 web pages about the 2017 Solar Eclipse. A final report, a final presentation, and several archives of code, input data, and results are included. The work implements basic natural language processing techniques such as word frequency, lemmatization, and part-of-speech tagging, working up to a complete human-readable summary at the end of the course. Extractive, abstractive, and combination methods were used to generate the final summaries, all of which are included and the results compared. The summary subjectively evaluated as best was a purely extractive summary built from concatenating summaries of document categories. This method was coherent and thorough, but involved manual tuning to select categories and still had some redundancy. All attempted methods are described and the less successful summaries are also included. This report presents a framework for how to summarize complex document collections with multiple relevant topics. The summary itself identifies information which was most covered about the Facebook-Cambridge Analytica data breach and is a reasonable introduction to the topic.en_US
dc.description.sponsorshipGlobal Event and Trend Archive Research (GETAR) projecten_US
dc.description.sponsorshipNSF: IIS-1619028en_US
dc.publisherVirginia Techen_US
dc.rightsAttribution-NonCommercial 3.0 United States*
dc.subjectnatural language processingen_US
dc.subjectdeep learningen_US
dc.subjectcomputer scienceen_US
dc.subjectnews articlesen_US
dc.subjectdata breachen_US
dc.subjectsolar eclipseen_US
dc.subjectCambridge Analyticaen_US
dc.subjectabstractive summarizationen_US
dc.subjectextractive summarizationen_US
dc.titleGenerating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipseen_US
dc.description.notesFiles Included: FacebookBreachSummarization_FinalReport.pdf - The final report covering the approaches, results, and lessons learned as a part of this project. Includes the resulting summaries as appendices, the results from earlier natural language approaches as tables and figures, and a breakdown of the file structures of included code archives. FacebookBreachSummarization_FinalReport.docx - An editable version of the final report. May not display correctly on all systems. FacebookBreachSummarization_FinalPresentation.pdf - The slides used to give the final presentation for CS 4984/5984. An overview of important results and takeaways that was intended to be presented in 10 minutes. FacebookBreachSummarization_FinalPresentation.pptx - An editable version of the final presentation. May not display correctly on all systems. - The majority of the code used to obtain the results covered in the report ant presentation, along with input datafiles and results where the results are saved to a file instead of written to the console. If looking to replicate results, start here. See the report for details on the file structure. - A clone of the original fast_abs_rl repository (, modified to work with our corpus. One method of generating abstractive summaries that was tested in this project. - A directory containing cleaned article text from the Facebook corpus saved as .story files for use with the abstractive summarizers. - A zipped archive of the packages needed to run the PySpark code included in this project which are not a part of base Python. For ease of running the included PySpark code on Python 2.7. - A copy of the pretrained pointer-generator network model which is also available through a Google Drive link on Works with TensorFlow 1.2.1. Can be used to rapidly generate single-document abstractive summaries without having to train a new model.en_US

Files in this item


This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial 3.0 United States
License: Attribution-NonCommercial 3.0 United States