Weakly Supervised Machine Learning for Cyberbullying Detection

Raisi, Elaheh

Weakly Supervised Machine Learning for Cyberbullying Detection

dc.contributor.author	Raisi, Elaheh	en
dc.contributor.committeechair	Huang, Bert	en
dc.contributor.committeemember	Ramakrishnan, Naren	en
dc.contributor.committeemember	Han, Richard Y.	en
dc.contributor.committeemember	Marathe, Madhav Vishnu	en
dc.contributor.committeemember	Wang, Gang Alan	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2019-04-24T08:00:54Z	en
dc.date.available	2019-04-24T08:00:54Z	en
dc.date.issued	2019-04-23	en
dc.description.abstract	The advent of social media has revolutionized human communication, significantly improving individuals' lives. It makes people closer to each other, provides access to enormous real-time information, and eases marketing and business. Despite its uncountable benefits, however, we must consider some of its negative implications such as online harassment and cyberbullying. Cyberbullying is becoming a serious, large-scale problem damaging people's online lives. This phenomenon is creating a need for automated, data-driven techniques for analyzing and detecting such behaviors. In this research, we aim to address the computational challenges associated with harassment-based cyberbullying detection in social media by developing machine-learning framework that only requires weak supervision. We propose a general framework that trains an ensemble of two learners in which each learner looks at the problem from a different perspective. One learner identifies bullying incidents by examining the language content in the message; another learner considers the social structure to discover bullying. Each learner is using different body of information, and the individual learner co-train one another to come to an agreement about the bullying concept. The models estimate whether each social interaction is bullying by optimizing an objective function that maximizes the consistency between these detectors. We first developed a model we referred to as participant-vocabulary consistency, which is an ensemble of two linear language-based and user-based models. The model is trained by providing a set of seed key-phrases that are indicative of bullying language. The results were promising, demonstrating its effectiveness and usefulness in recovering known bullying words, recognizing new bullying words, and discovering users involved in cyberbullying. We have extended this co-trained ensemble approach with two complementary goals: (1) using nonlinear embeddings as model families, (2) building a fair language-based detector. For the first goal, we incorporated the efficacy of distributed representations of words and nodes such as deep, nonlinear models. We represent words and users as low-dimensional vectors of real numbers as the input to language-based and user-based classifiers, respectively. The models are trained by optimizing an objective function that balances a co-training loss with a weak-supervision loss. Our experiments on Twitter, Ask.fm, and Instagram data show that deep ensembles outperform non-deep methods for weakly supervised harassment detection. For the second goal, we geared this research toward a very important topic in any online automated harassment detection: fairness against particular targeted groups including race, gender, religion, and sexual orientations. Our goal is to decrease the sensitivity of models to language describing particular social groups. We encourage the learning algorithm to avoid discrimination in the predictions by adding an unfairness penalty term to the objective function. We quantitatively and qualitatively evaluate the effectiveness of our proposed general framework on synthetic data and data from Twitter using post-hoc, crowdsourced annotation. In summary, this dissertation introduces a weakly supervised machine learning framework for harassment-based cyberbullying detection using both messages and user roles in social media.	en
dc.description.abstractgeneral	Social media has become an inevitable part of individuals social and business lives. Its benefits, however, come with various negative consequences such as online harassment, cyberbullying, hate speech, and online trolling especially among the younger population. According to the American Academy of Child and Adolescent Psychiatry,1 victims of bullying can suffer interference to social and emotional development and even be drawn to extreme behavior such as attempted suicide. Any widespread bullying enabled by technology represents a serious social health threat. In this research, we develop automated, data-driven methods for harassment-based cyberbullying detection. The availability of tools such as these can enable technologies that reduce the harm and toxicity created by these detrimental behaviors. Our general framework is based on consistency of two detectors that co-train one another. One learner identifies bullying incidents by examining the language content in the message; another learner considers social structure to discover bullying. When designing the general framework, we address three tasks: First, we use machine learning with weak supervision, which significantly alleviates the need for human experts to perform tedious data annotation. Second, we incorporate the efficacy of distributed representations of words and nodes such as deep, nonlinear models in the framework to improve the predictive power of models. Finally, we decrease the sensitivity of the framework to language describing particular social groups including race, gender, religion, and sexual orientation. This research represents important steps toward improving technological capability for automatic cyberbullying detection.	en
dc.description.degree	Doctor of Philosophy	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:19737	en
dc.identifier.uri	http://hdl.handle.net/10919/89100	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Machine learning	en
dc.subject	Weak Supervision	en
dc.subject	Cyberbullying detection	en
dc.subject	Social Media	en
dc.subject	Co-trained Ensemble	en
dc.title	Weakly Supervised Machine Learning for Cyberbullying Detection	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Doctor of Philosophy	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Raisi_E_D_2019.pdf
Size:: 1.63 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations