Hadoop Project for IDEAL in CS5604

Cadena, Jose; Chen, Mengsu; Wen, Chengyuan

Hadoop Project for IDEAL in CS5604

dc.contributor.author	Cadena, Jose	en
dc.contributor.author	Chen, Mengsu	en
dc.contributor.author	Wen, Chengyuan	en
dc.date.accessioned	2015-05-15T04:06:38Z	en
dc.date.available	2015-05-15T04:06:38Z	en
dc.date.issued	2015-05-11	en
dc.description.abstract	The Integrated Digital Event Archive and Library (IDEAL) system addresses the need for combining the best of digital library and archive technologies in support of stakeholders who are remembering and/or studying important events. It leverages and extends the capabilities of the Internet Archive to develop spontaneous event collections that can be permanently archived as well as searched and accessed. IDEAL connects the processing of tweets and web pages, combining informal and formal media to support building collections on chosen general or specific events. Integrated services include topic identification, categorization (building upon special ontologies being devised), clustering, and visualization of data, information, and context. The objective for the course is to build a state-of-the-art information retrieval system in support of the IDEAL project. Students were assigned to eight teams, each of which focused on a different part of the system to be built. These teams were Solr, Classification, Hadoop, Noise Reduction, LDA, Clustering, Social Networks, and NER. As the Hadoop team, our focus is on making the information retrieval system scalable to large datasets by taking advantage of the distributed computing capabilities of the Apache Hadoop framework. We design and put in place a general schema for storing and updating data stored in our Hadoop cluster. Throughout the project, we coordinate with other teams to help them make use of readily available machine learning software for Hadoop, and we also provide support for using MapReduce. We found that different teams were able to easily integrate their results in the design we developed and that uploading these results into a data store for communication with Solr can be done, in the best cases, in a few minutes. We conclude that Hadoop is an appropriate framework for the IDEAL project; however, we also recommend exploring the use of the Spark framework.	en
dc.description.sponsorship	NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)	en
dc.identifier.uri	http://hdl.handle.net/10919/52342	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/us/	en
dc.subject	IDEAL	en
dc.subject	Hadoop	en
dc.subject	Big Data	en
dc.subject	Information Retrieval	en
dc.title	Hadoop Project for IDEAL in CS5604	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en