Exploring the Blacksburg Community Events Collection

Abstract

With the advent of new technology, especially the combination of smart phones and widespread Internet access, people are increasingly becoming absorbed in digital worlds – worlds that are not bounded by geography. As such, some people worry about what this means for local communities. The Virtual Town Square project is an effort to harness people's use of these kinds of social networks, but with a focus on local communities. As part of the Fall 2014 CS4984 Computational Linguistics course, we explored a collection of documents, the Blacksburg Events Collection, that were mined from the Virtual Town Square for the town of Blacksburg, Virginia. We describe our activities to summarize this collection to inform newcomers about the local community. We begin by describing the approach that we took, which consisted of first cleaning our dataset and then applying the idea of Hierarchical Clustering to our collection. The core idea is to cluster the documents of our collection into sub-clusters, then cluster those sub-clusters, and then finally do sub-clustering on the sentences of the final sub-clusters. We then choose the sentences closest to the final sentence sub-cluster centroids as our summaries. Some of the summary sentences capture very relevant information about specific events in the community, but our final results still have a fair bit of noise and are not very concise. We then discuss some of the lessons that we learned throughout the course of the project, such as the importance of good project planning and quickly iterating on actual solutions instead of just discussing the multitude of approaches that can be taken. We then provide suggestions to improve upon our approach, especially ways to clean up the final sentence summaries. The appendix also contains a Developer’s Manual that describes the included files and the final code in detail.

Description

This submission comes with a number of files: final_report.docx/final_report.pdf - The final report describing the approach that we took along with some discussion about what we learned and would want to do to improve the results. final_presentation.pptx/final_presentation.pdf - The final presentation that we gave to the class to explain our final approach. be_files.zip - The Blacksburg Events collection along with all additional files that were generated during our cleaning up and clustering. A more thorough explanation can be found in the appendix of the final report. be_code.zip - The Python code and Bash scripts that we used in our final approach. A more thorough explanation can be found in the appendix of the final report. be_results.zip - This file contains each of our cluster-of-interest’s output sentences as a file that has each sentence on a line.

Keywords

computational linguistics, blacksburg, community, virtual town square, summarization, hadoop, clustering, hierarchical clustering, mahout, python, cleaning up documents

Citation