Exploring the Blacksburg Community Events Collection

dc.contributor.authorAntol, Stanislawen
dc.contributor.authorAyoub, Souleimanen
dc.contributor.authorFolgar, Carlosen
dc.contributor.authorSmith, Steveen
dc.date.accessioned2014-12-14T00:40:00Zen
dc.date.available2014-12-14T00:40:00Zen
dc.date.issued2014-12en
dc.descriptionThis submission comes with a number of files: final_report.docx/final_report.pdf - The final report describing the approach that we took along with some discussion about what we learned and would want to do to improve the results. final_presentation.pptx/final_presentation.pdf - The final presentation that we gave to the class to explain our final approach. be_files.zip - The Blacksburg Events collection along with all additional files that were generated during our cleaning up and clustering. A more thorough explanation can be found in the appendix of the final report. be_code.zip - The Python code and Bash scripts that we used in our final approach. A more thorough explanation can be found in the appendix of the final report. be_results.zip - This file contains each of our cluster-of-interest’s output sentences as a file that has each sentence on a line.en
dc.description.abstractWith the advent of new technology, especially the combination of smart phones and widespread Internet access, people are increasingly becoming absorbed in digital worlds – worlds that are not bounded by geography. As such, some people worry about what this means for local communities. The Virtual Town Square project is an effort to harness people's use of these kinds of social networks, but with a focus on local communities. As part of the Fall 2014 CS4984 Computational Linguistics course, we explored a collection of documents, the Blacksburg Events Collection, that were mined from the Virtual Town Square for the town of Blacksburg, Virginia. We describe our activities to summarize this collection to inform newcomers about the local community. We begin by describing the approach that we took, which consisted of first cleaning our dataset and then applying the idea of Hierarchical Clustering to our collection. The core idea is to cluster the documents of our collection into sub-clusters, then cluster those sub-clusters, and then finally do sub-clustering on the sentences of the final sub-clusters. We then choose the sentences closest to the final sentence sub-cluster centroids as our summaries. Some of the summary sentences capture very relevant information about specific events in the community, but our final results still have a fair bit of noise and are not very concise. We then discuss some of the lessons that we learned throughout the course of the project, such as the importance of good project planning and quickly iterating on actual solutions instead of just discussing the multitude of approaches that can be taken. We then provide suggestions to improve upon our approach, especially ways to clean up the final sentence summaries. The appendix also contains a Developer’s Manual that describes the included files and the final code in detail.en
dc.description.sponsorshipNSF DUE-1141209 and IIS-1319578en
dc.identifier.urihttp://hdl.handle.net/10919/51135en
dc.language.isoen_USen
dc.rightsCreative Commons Attribution 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/en
dc.subjectcomputational linguisticsen
dc.subjectblacksburgen
dc.subjectcommunityen
dc.subjectvirtual town squareen
dc.subjectsummarizationen
dc.subjecthadoopen
dc.subjectclusteringen
dc.subjecthierarchical clusteringen
dc.subjectmahouten
dc.subjectpythonen
dc.subjectcleaning up documentsen
dc.titleExploring the Blacksburg Community Events Collectionen
dc.typeDataseten
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 7
Name:
be_results.zip
Size:
43.19 KB
Format:
Unknown data format
Description:
The final results of our approach.
Name:
be_code.zip
Size:
11.26 KB
Format:
Unknown data format
Description:
The Python and Bash code that we used (along with the commands in the report) for the final approach.
Name:
be_files.zip
Size:
222.58 MB
Format:
Unknown data format
Description:
The collection files that were generated at various stages (e.g., removing stop words, clustering).
Loading...
Thumbnail Image
Name:
final_presentation.pdf
Size:
231 KB
Format:
Adobe Portable Document Format
Description:
Final Presentation (PDF)
Name:
final_presentation.pptx
Size:
184.4 KB
Format:
Microsoft Powerpoint XML
Description:
Final Presentation (PowerPoint)
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: