Browsing by Author "Tao, Rongrong"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Anomalous Information Detection in Social MediaTao, Rongrong (Virginia Tech, 2021-03-10)This dissertation focuses on identifying various types of anomalous information pattern in social media and news outlets. We focus on three types of anomalous information, including (1) media censorship in news outlets, which is information that should be published but is actually missing, (2) fake news in social media, which is unreliable information shown to the public, and (3) media propaganda in news outlets, which is trustworthy information but being over-populated. For the first problem, existing approaches on censorship detection mostly rely on monitoring posts in social media. However, media censorship in news outlets has not received nearly as much attention, mostly because it is difficult to systematically detect. The contributions of our work include: (1) a hypothesis testing framework to identify and evaluate censored clusters of keywords, (2) a near-linear-time algorithm to identify the highest scoring clusters as indicators of censorship, and (3) extensive experiments on six Latin American countries for performance evaluation. For the second problem, existing approaches studying fake news in social media primarily focus on topic-level modeling or prediction based on a set of aggregated features from a col- lection of posts. However, the credibility of various information components within the same topic can be quite different. The contributions of our work in this space include: (1) a new benchmark dataset for fake news research, (2) a cluster-based approach to improve instance- level prediction of information credibility, and (3) extensive experiments for performance evaluations. For the last problem, existing approaches to media propaganda detection primarily focus on investigating the pattern of information shared over social media or evaluation from domain experts. However, these approaches cannot be generalized to a large-scale analysis of media propaganda in news outlets. The contributions of our work include: (1) non- parametric scan statistics to identify clusters of over-populated keywords, (2) a near-linear-time algorithm to identify the highest scoring clusters as indicators of propaganda, and (3) extensive experiments on two Latin American countries for performance evaluation.
- Classification Team Project for IDEAL in CS5604, Spring 2015Cui, Xuewen; Tao, Rongrong; Zhang, Ruide (2015-05-10)Given the tweets from the instructor and cleaned webpages from the Reducing Noise team, the planned tasks for our group were to find the best: (1) way to extract information that will be used for document representation; (2) feature selection method to construct feature vectors; and (3) way to classify each document into categories, considering the ontology developed in the IDEAL project. We have figured out an information extraction method for document representation, feature selection method for feature vector construction, and classification method. The categories will be associated with the documents, to aid searching and browsing using Solr. Our team handles both tweets and webpages. The tweets and webpages come in the form of text files that have been produced by the Reducing Noise team. The other input is a list of the specific events that the collections are about. We are able to construct feature vectors after information extraction and feature selection using Apache Mahout. For each document, a relational version of the raw data for an appropriate feature vector is generated. We applied the Naïve Bayes classification algorithm in Apache Mahout to generate the vector file and the trained model. The classification algorithm uses the feature vectors to go into classifiers for training and testing that works with Mahout. However, Mahout is not able to predict class labels for new data. Finally we came to a solution provided by Pangool.net, which is a Java, low-level MapReduce API. This package provides us a MapReduce Naïve Bayes classifier that can predict class labels for new data. After modification, this package is able to read in and output to AVRO file in HDFS. The correctness of our classification algorithms, using 5-fold cross-validation, was promising.