Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use Cases

Karajeh, Ola; Arachie, Chidubem; Powell, Edward; Hussein, Eslam

Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use Cases

dc.contributor.author	Karajeh, Ola	en
dc.contributor.author	Arachie, Chidubem	en
dc.contributor.author	Powell, Edward	en
dc.contributor.author	Hussein, Eslam	en
dc.date.accessioned	2020-01-11T01:55:58Z	en
dc.date.available	2020-01-11T01:55:58Z	en
dc.date.issued	2019-12-24	en
dc.description.abstract	The proliferation of data on social media has driven the need for researchers to develop algorithms to filter and process this data into meaningful information. In this project, we consider the task of classifying tweets relative to some topic or event and labeling them as informational or non-informational, using the features in the tweets. We focus on two collections from different domains: a diabetes dataset in the health domain and a heartbleed dataset in the security domain. We show the performance of our method in classifying tweets in the different collections. We employ two approaches to generate features for our models: 1) a graph based feature representation and 2) a vector space model, e.g., with TF-IDF weighting or a word embedding. The representations generated are fed into different machine learning algorithms (Logistic Regression, Naïve Bayes, and Decision Tree) to perform the classification task. We evaluate these approaches using metrics (accuracy, precision, recall, and F1-score) on a held out test dataset. Our results show that we can generalize our approach with tweets across different domains.	en
dc.description.notes	TweetCollectionFiles.zip - Contains all files relevant to the project Folders in TweetCollectionFiles.zip: Code - Contains all of the code used for pre-processing, cleaning, feature extraction, classifiers, and evaluation. Datasets - Contains both the labeled and cleaned versions of Diabetes and Heartbleed virus datasets. Presentation - Contains editable and viewable versions of a presentation. Methodology - Contains details and screenshots of our methodology. TweetCollectionsReport.zip: Contains the Overleaf download and a PDF version of our report.	en
dc.description.sponsorship	NSF IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/96396	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons CC0 1.0 Universal Public Domain Dedication	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	Twitter	en
dc.subject	Machine learning	en
dc.subject	Term-Document Matrix (TDM)	en
dc.subject	Graph Based Model	en
dc.subject	Word Embedding	en
dc.title	Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use Cases	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en