Show simple item record

dc.contributor.authorBock, Matthew
dc.contributor.authorCantrell, Michael
dc.contributor.authorShahin, Hossameldin
dc.date.accessioned2016-05-07T15:35:33Z
dc.date.available2016-05-07T15:35:33Z
dc.date.issued2016-05-04
dc.identifier.urihttp://hdl.handle.net/10919/70929
dc.descriptionThe Classification team submission contains the following: 1- Final Technical Report 2- Final Presentation. 3- Zip file containing project source code and data in the team’s GitHub repository 4- Zip file containing final report LaTeX project (To build this report from scratch follow the instructions in the README file. Or you can import the zip file directly to www.overleaf.com and it will be compiled online.)en_US
dc.description.abstractIn the grand scheme of a large Information Retrieval project, the work of our team was that of performing text classification on both tweet collections and their associated webpages. In order to accomplish this task, we sought to complete three primary goals. We began by performing research to determine the best way to extract information that can be used to represent a given document. Following that, we worked to determine the best method to select features and then construct feature vectors. Our final goal was to use the information gathered previously to build an effective way to classify each document in the tweet and webpage collections. These classifiers were built with consideration of the ontology developed for the IDEAL project. To truly show the effectiveness of our work at accomplishing our intended goals, we also provide an evaluation of our methodologies. The team assigned to perform this classification work last year researched various methods and tools that could be useful in accomplishing the goals we have set forth. Last year’s team developed a system that was able to accomplish similar goals to those we have set forth with a promising degree of success. Our goal for this year was to improve upon their successes using new technologies such as Apache Spark. Spark has provided us with the tools needed to build a well optimized system capable of working with the provided small collections of tweets and webpages in a fast and efficient manner. Spark is also very scalable, and based on our results with the small collections we have confidence in the performance of our system on larger collections. Also included in this submission is our final presentation of the project as presented to the CS5604 class, professor, and GRAs. The presentation provides a high level overview of the project requirements and our approach to them, as well as details about our implementation and evaluation. The submission also includes our source code, so that future classes can expand on the work we have done this semester.en_US
dc.description.sponsorshipNSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)en_US
dc.language.isoen_USen_US
dc.rightsCC0 1.0 Universal*
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/*
dc.subjectClassificationen_US
dc.subjectApache Sparken_US
dc.subjectLogistic Regressionen_US
dc.subjectFeature Selectionen_US
dc.subjectFrequent Pattern Miningen_US
dc.subjectFPMen_US
dc.subjectText Classificationen_US
dc.titleClassification Project in CS5604, Spring 2016en_US
dc.typePresentationen_US
dc.typeSoftwareen_US
dc.typeTechnical reporten_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

CC0 1.0 Universal
License: CC0 1.0 Universal