Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use Cases

The proliferation of data on social media has driven the need for researchers to develop algorithms to filter and process this data into meaningful information. In this project, we consider the task of classifying tweets relative to some topic or event and labeling them as informational or non-informational, using the features in the tweets. We focus on two collections from different domains: a diabetes dataset in the health domain and a heartbleed dataset in the security domain. We show the performance of our method in classifying tweets in the different collections. We employ two approaches to generate features for our models: 1) a graph based feature representation and 2) a vector space model, e.g., with TF-IDF weighting or a word embedding. The representations generated are fed into different machine learning algorithms (Logistic Regression, Naïve Bayes, and Decision Tree) to perform the classification task. We evaluate these approaches using metrics (accuracy, precision, recall, and F1-score) on a held out test dataset. Our results show that we can generalize our approach with tweets across different domains.

Keywords

Twitter, Machine learning, Term-Document Matrix (TDM), Graph Based Model, Word Embedding

Persistent link

http://hdl.handle.net/10919/96396

Collections

CS6604: Digital Libraries

Full item page

Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use Cases

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections