Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use Cases

dc.contributor.authorKarajeh, Olaen
dc.contributor.authorArachie, Chidubemen
dc.contributor.authorPowell, Edwarden
dc.contributor.authorHussein, Eslamen
dc.date.accessioned2020-01-11T01:55:58Zen
dc.date.available2020-01-11T01:55:58Zen
dc.date.issued2019-12-24en
dc.description.abstractThe proliferation of data on social media has driven the need for researchers to develop algorithms to filter and process this data into meaningful information. In this project, we consider the task of classifying tweets relative to some topic or event and labeling them as informational or non-informational, using the features in the tweets. We focus on two collections from different domains: a diabetes dataset in the health domain and a heartbleed dataset in the security domain. We show the performance of our method in classifying tweets in the different collections. We employ two approaches to generate features for our models: 1) a graph based feature representation and 2) a vector space model, e.g., with TF-IDF weighting or a word embedding. The representations generated are fed into different machine learning algorithms (Logistic Regression, Naïve Bayes, and Decision Tree) to perform the classification task. We evaluate these approaches using metrics (accuracy, precision, recall, and F1-score) on a held out test dataset. Our results show that we can generalize our approach with tweets across different domains.en
dc.description.notesTweetCollectionFiles.zip - Contains all files relevant to the project Folders in TweetCollectionFiles.zip: Code - Contains all of the code used for pre-processing, cleaning, feature extraction, classifiers, and evaluation. Datasets - Contains both the labeled and cleaned versions of Diabetes and Heartbleed virus datasets. Presentation - Contains editable and viewable versions of a presentation. Methodology - Contains details and screenshots of our methodology. TweetCollectionsReport.zip: Contains the Overleaf download and a PDF version of our report.en
dc.description.sponsorshipNSF IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/96396en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectTwitteren
dc.subjectMachine learningen
dc.subjectTerm-Document Matrix (TDM)en
dc.subjectGraph Based Modelen
dc.subjectWord Embeddingen
dc.titleTweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use Casesen
dc.typeDataseten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 2 of 2
Name:
TweetCollectionReport.zip
Size:
2.57 MB
Format:
Name:
TweetCollectionFiles.zip
Size:
2.64 MB
Format:
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: