NRV Tweets and RSS feeds
dc.contributor.author | Roble, Benjamin | en |
dc.contributor.author | Cheng, Justin | en |
dc.contributor.author | Sbitani, Marwan | en |
dc.date.accessioned | 2014-05-09T19:19:09Z | en |
dc.date.available | 2014-05-09T19:19:09Z | en |
dc.date.issued | 2014-05-09 | en |
dc.description | This collection contains the source code, programs, documentation, and example data used in the project. Please review the "Final Report and Technical Manual" for a comprehensive overview of the project. The open source library Mallet was used and is referenced here: McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002. | en |
dc.description.abstract | The goal of this project was to associate existing data in the Virtual Town Square database from the New River Valley area with topical metadata. We took a database of approximately 360,000 tweets and 15,000 RSS news stories collected in the last two years and associated each RSS story and tweet with topics. The open-source natural language processing library Mallet was used to perform topical modeling on the data using Latent Dirichlet Allocation, which was then used to create a Solr instance of searchable tweets and news stories. Topical modeling was not done around specific events, instead the entire tweet data (and entire RSS data) was used as the corpus. The tweet data was analyzed separately from the RSS stories, so the generated topics are specific to each dataset. This report details the methodology used in our work in the Methodology section and contains a detailed Developer’s Guide and User’s Guide so that others may continue our work. The client was satisfied with the outcome of this project as, even though tweets have generally been considered too short to be run through a topical modeling process, we generated topics for each tweet that appear to be relevant and accurate. | en |
dc.description.sponsorship | Virginia Tech Center for Human-Computer Interaction Associate Director: Dr. Kavanaugh, kavan@vt.edu; Virginia Tech PhD Student: Ji Wang (InfoVis Lab), wji@cs.vt.edu; Virginia Tech PhD Student: Mohamed Magdy, mmagdy@vt.edu; Virginia Tech Professor: Dr. Edward Fox, fox@vt.edu | en |
dc.identifier.uri | http://hdl.handle.net/10919/47937 | en |
dc.language.iso | en_US | en |
dc.rights | Creative Commons Attribution 3.0 United States | en |
dc.rights.uri | http://creativecommons.org/licenses/by/3.0/us/ | en |
dc.subject | nlp | en |
dc.subject | natural language processing | en |
dc.subject | lda | en |
dc.subject | latent dirichlet allocation | en |
dc.subject | mallet | en |
dc.subject | open source | en |
dc.subject | tweets | en |
dc.subject | rss | en |
dc.subject | nrv | en |
dc.subject | new river valley | en |
dc.subject | blacksburg | en |
dc.subject | IDEAL | en |
dc.title | NRV Tweets and RSS feeds | en |
dc.type | Dataset | en |
dc.type | Presentation | en |
dc.type | Software | en |
dc.type | Technical report | en |
Files
Original bundle
1 - 5 of 10
- Name:
- JSONLoader.tar.gz
- Size:
- 1.24 MB
- Format:
- Unknown data format
- Description:
- JSONLoader java class and libraries
- Name:
- mallet.tar.gz
- Size:
- 47.58 MB
- Format:
- Unknown data format
- Description:
- Mallet source code and script
- Name:
- nrvtweets_data.tar.gz
- Size:
- 2.35 KB
- Format:
- Unknown data format
- Description:
- Tweets and RSS data (both raw and processed)
- Name:
- solr_data.tar.gz
- Size:
- 14.91 KB
- Format:
- Unknown data format
- Description:
- Solr data and schema
Loading...
- Name:
- CS 4624 NRV Tweets Midterm.pdf
- Size:
- 679.43 KB
- Format:
- Adobe Portable Document Format
- Description:
- Midterm Presentation PDF
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: