Classification Project in CS5604, Spring 2016

Abstract

In the grand scheme of a large Information Retrieval project, the work of our team was that of performing text classification on both tweet collections and their associated webpages. In order to accomplish this task, we sought to complete three primary goals. We began by performing research to determine the best way to extract information that can be used to represent a given document. Following that, we worked to determine the best method to select features and then construct feature vectors. Our final goal was to use the information gathered previously to build an effective way to classify each document in the tweet and webpage collections. These classifiers were built with consideration of the ontology developed for the IDEAL project. To truly show the effectiveness of our work at accomplishing our intended goals, we also provide an evaluation of our methodologies. The team assigned to perform this classification work last year researched various methods and tools that could be useful in accomplishing the goals we have set forth. Last year’s team developed a system that was able to accomplish similar goals to those we have set forth with a promising degree of success. Our goal for this year was to improve upon their successes using new technologies such as Apache Spark. Spark has provided us with the tools needed to build a well optimized system capable of working with the provided small collections of tweets and webpages in a fast and efficient manner. Spark is also very scalable, and based on our results with the small collections we have confidence in the performance of our system on larger collections.

Also included in this submission is our final presentation of the project as presented to the CS5604 class, professor, and GRAs. The presentation provides a high level overview of the project requirements and our approach to them, as well as details about our implementation and evaluation. The submission also includes our source code, so that future classes can expand on the work we have done this semester.

Description
The Classification team submission contains the following: 1- Final Technical Report 2- Final Presentation. 3- Zip file containing project source code and data in the team’s GitHub repository 4- Zip file containing final report LaTeX project (To build this report from scratch follow the instructions in the README file. Or you can import the zip file directly to www.overleaf.com and it will be compiled online.)
Keywords
Classification, Apache Spark, Logistic Regression, Feature Selection, Frequent Pattern Mining, FPM, Text Classification
Citation