Classification Team Project for IDEAL in CS5604, Spring 2015

Cui, Xuewen; Tao, Rongrong; Zhang, Ruide

Classification Team Project for IDEAL in CS5604, Spring 2015

dc.contributor.author	Cui, Xuewen	en
dc.contributor.author	Tao, Rongrong	en
dc.contributor.author	Zhang, Ruide	en
dc.date.accessioned	2015-05-13T01:32:59Z	en
dc.date.available	2015-05-13T01:32:59Z	en
dc.date.issued	2015-05-10	en
dc.description.abstract	Given the tweets from the instructor and cleaned webpages from the Reducing Noise team, the planned tasks for our group were to find the best: (1) way to extract information that will be used for document representation; (2) feature selection method to construct feature vectors; and (3) way to classify each document into categories, considering the ontology developed in the IDEAL project. We have figured out an information extraction method for document representation, feature selection method for feature vector construction, and classification method. The categories will be associated with the documents, to aid searching and browsing using Solr. Our team handles both tweets and webpages. The tweets and webpages come in the form of text files that have been produced by the Reducing Noise team. The other input is a list of the specific events that the collections are about. We are able to construct feature vectors after information extraction and feature selection using Apache Mahout. For each document, a relational version of the raw data for an appropriate feature vector is generated. We applied the Naïve Bayes classification algorithm in Apache Mahout to generate the vector file and the trained model. The classification algorithm uses the feature vectors to go into classifiers for training and testing that works with Mahout. However, Mahout is not able to predict class labels for new data. Finally we came to a solution provided by Pangool.net, which is a Java, low-level MapReduce API. This package provides us a MapReduce Naïve Bayes classifier that can predict class labels for new data. After modification, this package is able to read in and output to AVRO file in HDFS. The correctness of our classification algorithms, using 5-fold cross-validation, was promising.	en
dc.description.sponsorship	NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)	en
dc.identifier.uri	http://hdl.handle.net/10919/52253	en
dc.language.iso	en_US	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Classification	en
dc.subject	Apache Hadoop	en
dc.subject	Apache Mahout	en
dc.subject	Naive Bayes	en
dc.subject	Feature Selection	en
dc.subject	Pangool	en
dc.title	Classification Team Project for IDEAL in CS5604, Spring 2015	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en