Show simple item record

dc.contributor.authorHamilton, Leahen
dc.contributor.authorRobb, Estheren
dc.contributor.authorFitzpatrick, Aprilen
dc.contributor.authorGoel, Akshayen
dc.contributor.authorNandigam, Ramyaen
dc.description.abstractSummarization is often a time-consuming task for humans. Automated methods can summarize a larger volume of source material in a shorter amount of time, but creating a good summary with these methods remains challenging. This submission contains all work related to a semester-long project in CS 4984/5984 to generate the best possible summary of a collection of 10,829 web pages about the Facebook-Cambridge Analytica data breach, with some early prototyping done on 500 web pages about the 2017 Solar Eclipse. A final report, a final presentation, and several archives of code, input data, and results are included. The work implements basic natural language processing techniques such as word frequency, lemmatization, and part-of-speech tagging, working up to a complete human-readable summary at the end of the course. Extractive, abstractive, and combination methods were used to generate the final summaries, all of which are included and the results compared. The summary subjectively evaluated as best was a purely extractive summary built from concatenating summaries of document categories. This method was coherent and thorough, but involved manual tuning to select categories and still had some redundancy. All attempted methods are described and the less successful summaries are also included. This report presents a framework for how to summarize complex document collections with multiple relevant topics. The summary itself identifies information which was most covered about the Facebook-Cambridge Analytica data breach and is a reasonable introduction to the topic.en
dc.description.sponsorshipGlobal Event and Trend Archive Research (GETAR) projecten
dc.description.sponsorshipNSF: IIS-1619028en
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial 3.0 United Statesen
dc.subjectnatural language processingen
dc.subjectdeep learningen
dc.subjectcomputer scienceen
dc.subjectnews articlesen
dc.subjectdata breachen
dc.subjectsolar eclipseen
dc.subjectCambridge Analyticaen
dc.subjectabstractive summarizationen
dc.subjectextractive summarizationen
dc.titleGenerating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipseen
dc.description.notesFiles Included: FacebookBreachSummarization_FinalReport.pdf - The final report covering the approaches, results, and lessons learned as a part of this project. Includes the resulting summaries as appendices, the results from earlier natural language approaches as tables and figures, and a breakdown of the file structures of included code archives. FacebookBreachSummarization_FinalReport.docx - An editable version of the final report. May not display correctly on all systems. FacebookBreachSummarization_FinalPresentation.pdf - The slides used to give the final presentation for CS 4984/5984. An overview of important results and takeaways that was intended to be presented in 10 minutes. FacebookBreachSummarization_FinalPresentation.pptx - An editable version of the final presentation. May not display correctly on all systems. - The majority of the code used to obtain the results covered in the report ant presentation, along with input datafiles and results where the results are saved to a file instead of written to the console. If looking to replicate results, start here. See the report for details on the file structure. - A clone of the original fast_abs_rl repository (, modified to work with our corpus. One method of generating abstractive summaries that was tested in this project. - A directory containing cleaned article text from the Facebook corpus saved as .story files for use with the abstractive summarizers. - A zipped archive of the packages needed to run the PySpark code included in this project which are not a part of base Python. For ease of running the included PySpark code on Python 2.7. - A copy of the pretrained pointer-generator network model which is also available through a Google Drive link on Works with TensorFlow 1.2.1. Can be used to rapidly generate single-document abstractive summaries without having to train a new model.en

Files in this item


This item appears in the following Collection(s)

Show simple item record

Creative Commons Attribution-NonCommercial 3.0 United States
License: Creative Commons Attribution-NonCommercial 3.0 United States