Computational Linguistic Analysis of Earthquake Collections

Bialousz, KennethKokal, KevinOrleans-Pobee, KwaminaWakeley, Christopher2014-12-132014-12-132014-12http://hdl.handle.net/10919/51132Both PDF and Word versions for the final report, a ZIP file of source code, and a PDF and PowerPoint of the final presentation.CS4984 is a newly-offered class at Virginia Tech with a unit based, project-problem based learning curriculum. This class style is based on NSF-funded work on curriculum for the field of digital libraries and related topics, and in this class, is used to guide a student based investigation of computational linguistics. The specific problem this report addresses is the creation of a means to automatically generate a short summary of a corpus of articles about earthquakes. Such a summary should be best representative of the texts and include all relevant information about earthquakes. For our analysis, we operated on two corpora--one about a 5.8 magnitude earthquake in Virginia in August 2011, and another about a 6.6 magnitude earthquake in April 2013 in Lushan, China. Techniques used to analyze the articles include clustering, lemmatization, frequency analysis of n-grams, and regular expression searches.enCreative Commons CC0 1.0 Universal Public Domain Dedicationnatural language processingHadoopMahoutLDAK-means clusteringNLTKPythonnatural language generationSolrStanford NERpart-of-speech taggingComputational Linguistic Analysis of Earthquake CollectionsDataset