OutbreakSum: Automatic Summarization of Texts Relating to Disease Outbreaks
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The goal of the fall 2014 Disease Outbreak Project (OutbreakSum) was to develop software for automatically analyzing and summarizing large collections of texts pertaining to disease outbreaks. Although our code was tested on collections about specific diseases--a small one about Encephalitis and a large one about Ebola--most of our tools would work on texts about any infectious disease, where the key information relates to locations, dates, number of cases, symptoms, prognosis, and government and healthcare organization interventions. In the course of the project, we developed a code base that performs several key Natural Language Processing (NLP) functions. Some of the tools that could potentially be useful for other Natural Language Generation (NLG) projects include:
- A framework for developing MapReduce programs in Python that allows for local running and debugging;
- Tools for document collection cleanup procedures such as small-file removal, duplicate-file removal (based on content hashes), sentence and paragraph tokenization, nonrelevant file removal, and encoding translation;
- Utilities to simplify and speed up Named Entity Recognition with Stanford NER by using the Java API directly;
- Utilities to leverage the full extent of the Stanford CoreNLP library, which include tools for parsing and coreference resolution;
- Utilities to simplify using the OpenNLP Java library for text processing. By configuring and running a single Java class, you can use OpenNLP to perform part-of-speech tagging and named entity recognition on your entire collection in minutes.
We’ve classified the tools available in OutbreakSum into four major modules:
- Collection Processing;
- Local Language Processing;
- MapReduce with Apache Hadoop;
- Summarization.