Classification Team Project for IDEAL in CS5604, Spring 2015

dc.contributor.authorCui, Xuewenen
dc.contributor.authorTao, Rongrongen
dc.contributor.authorZhang, Ruideen
dc.date.accessioned2015-05-13T01:32:59Zen
dc.date.available2015-05-13T01:32:59Zen
dc.date.issued2015-05-10en
dc.description.abstractGiven the tweets from the instructor and cleaned webpages from the Reducing Noise team, the planned tasks for our group were to find the best: (1) way to extract information that will be used for document representation; (2) feature selection method to construct feature vectors; and (3) way to classify each document into categories, considering the ontology developed in the IDEAL project. We have figured out an information extraction method for document representation, feature selection method for feature vector construction, and classification method. The categories will be associated with the documents, to aid searching and browsing using Solr. Our team handles both tweets and webpages. The tweets and webpages come in the form of text files that have been produced by the Reducing Noise team. The other input is a list of the specific events that the collections are about. We are able to construct feature vectors after information extraction and feature selection using Apache Mahout. For each document, a relational version of the raw data for an appropriate feature vector is generated. We applied the Naïve Bayes classification algorithm in Apache Mahout to generate the vector file and the trained model. The classification algorithm uses the feature vectors to go into classifiers for training and testing that works with Mahout. However, Mahout is not able to predict class labels for new data. Finally we came to a solution provided by Pangool.net, which is a Java, low-level MapReduce API. This package provides us a MapReduce Naïve Bayes classifier that can predict class labels for new data. After modification, this package is able to read in and output to AVRO file in HDFS. The correctness of our classification algorithms, using 5-fold cross-validation, was promising.en
dc.description.sponsorshipNSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)en
dc.identifier.urihttp://hdl.handle.net/10919/52253en
dc.language.isoen_USen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectClassificationen
dc.subjectApache Hadoopen
dc.subjectApache Mahouten
dc.subjectNaive Bayesen
dc.subjectFeature Selectionen
dc.subjectPangoolen
dc.titleClassification Team Project for IDEAL in CS5604, Spring 2015en
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 5
Name:
code.rar
Size:
129.52 MB
Format:
Unknown data format
Description:
code
Name:
ReportClassify.docx
Size:
9.25 MB
Format:
Microsoft Word XML
Description:
ReportClassdocx
Loading...
Thumbnail Image
Name:
ReportClassify.pdf
Size:
3.15 MB
Format:
Adobe Portable Document Format
Description:
ReportClasspdf
Name:
PresentClassif.pptx
Size:
479.66 KB
Format:
Microsoft Powerpoint XML
Description:
PresentClassppt
Loading...
Thumbnail Image
Name:
PresentClassif.pdf
Size:
369.86 KB
Format:
Adobe Portable Document Format
Description:
PresentClasspdf
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: