Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipse
dc.contributor.author | Hamilton, Leah | en |
dc.contributor.author | Robb, Esther | en |
dc.contributor.author | Fitzpatrick, April | en |
dc.contributor.author | Goel, Akshay | en |
dc.contributor.author | Nandigam, Ramya | en |
dc.date.accessioned | 2018-12-14T15:26:38Z | en |
dc.date.available | 2018-12-14T15:26:38Z | en |
dc.date.issued | 2018-12-13 | en |
dc.description.abstract | Summarization is often a time-consuming task for humans. Automated methods can summarize a larger volume of source material in a shorter amount of time, but creating a good summary with these methods remains challenging. This submission contains all work related to a semester-long project in CS 4984/5984 to generate the best possible summary of a collection of 10,829 web pages about the Facebook-Cambridge Analytica data breach, with some early prototyping done on 500 web pages about the 2017 Solar Eclipse. A final report, a final presentation, and several archives of code, input data, and results are included. The work implements basic natural language processing techniques such as word frequency, lemmatization, and part-of-speech tagging, working up to a complete human-readable summary at the end of the course. Extractive, abstractive, and combination methods were used to generate the final summaries, all of which are included and the results compared. The summary subjectively evaluated as best was a purely extractive summary built from concatenating summaries of document categories. This method was coherent and thorough, but involved manual tuning to select categories and still had some redundancy. All attempted methods are described and the less successful summaries are also included. This report presents a framework for how to summarize complex document collections with multiple relevant topics. The summary itself identifies information which was most covered about the Facebook-Cambridge Analytica data breach and is a reasonable introduction to the topic. | en |
dc.description.notes | Files Included: FacebookBreachSummarization_FinalReport.pdf - The final report covering the approaches, results, and lessons learned as a part of this project. Includes the resulting summaries as appendices, the results from earlier natural language approaches as tables and figures, and a breakdown of the file structures of included code archives. FacebookBreachSummarization_FinalReport.docx - An editable version of the final report. May not display correctly on all systems. FacebookBreachSummarization_FinalPresentation.pdf - The slides used to give the final presentation for CS 4984/5984. An overview of important results and takeaways that was intended to be presented in 10 minutes. FacebookBreachSummarization_FinalPresentation.pptx - An editable version of the final presentation. May not display correctly on all systems. FacebookBreachSummarization_CodeAndResults.zip - The majority of the code used to obtain the results covered in the report ant presentation, along with input datafiles and results where the results are saved to a file instead of written to the console. If looking to replicate results, start here. See the report for details on the file structure. FacebookBreachSummarization_FastAbsRLFork.zip - A clone of the original fast_abs_rl repository (https://github.com/ChenRocks/fast_abs_rl), modified to work with our corpus. One method of generating abstractive summaries that was tested in this project. FacebookBreachSummarization_FilesForAbsSum.zip - A directory containing cleaned article text from the Facebook corpus saved as .story files for use with the abstractive summarizers. FacebookBreachSummarization_PySparkExtPkgs.zip - A zipped archive of the packages needed to run the PySpark code included in this project which are not a part of base Python. For ease of running the included PySpark code on Python 2.7. FacebookBreackSummarization_PretrainedPGN.zip - A copy of the pretrained pointer-generator network model which is also available through a Google Drive link on https://github.com/abisee/pointer-generator. Works with TensorFlow 1.2.1. Can be used to rapidly generate single-document abstractive summaries without having to train a new model. | en |
dc.description.sponsorship | Global Event and Trend Archive Research (GETAR) project | en |
dc.description.sponsorship | NSF: IIS-1619028 | en |
dc.identifier.uri | http://hdl.handle.net/10919/86395 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons Attribution-NonCommercial 3.0 United States | en |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/3.0/us/ | en |
dc.subject | natural language processing | en |
dc.subject | summarization | en |
dc.subject | deep learning | en |
dc.subject | computer science | en |
dc.subject | news articles | en |
dc.subject | data breach | en |
dc.subject | en | |
dc.subject | solar eclipse | en |
dc.subject | Cambridge Analytica | en |
dc.subject | abstractive summarization | en |
dc.subject | extractive summarization | en |
dc.title | Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar Eclipse | en |
dc.type | Dataset | en |
dc.type | Presentation | en |
dc.type | Report | en |
dc.type | Software | en |
Files
Original bundle
1 - 5 of 9
Loading...
- Name:
- FacebookBreachSummarization_FinalPresentation.pdf
- Size:
- 312.32 KB
- Format:
- Adobe Portable Document Format
- Name:
- FacebookBreachSummarization_FinalPresentation.pptx
- Size:
- 367.48 KB
- Format:
- Microsoft Powerpoint XML
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: