Solr Team Project Report

Gruss, Richard; Choudhury, Ananya; Komawar, Nikhil

Solr Team Project Report

dc.contributor.author	Gruss, Richard	en
dc.contributor.author	Choudhury, Ananya	en
dc.contributor.author	Komawar, Nikhil	en
dc.date.accessioned	2015-05-13T16:39:18Z	en
dc.date.available	2015-05-13T16:39:18Z	en
dc.date.issued	2015-05-13	en
dc.description	Solr Team Deliverables for Spring 2015 Information Retrieval.	en
dc.description.abstract	The Integrated Digital Event Archive and Library (IDEAL) is a Digital Library project that aims to collect, index, archive and provide access to digital contents related to important events, including disasters, man-made or natural. It extracts event data mostly from social media sites such as Twitter and crawls related web. However, the volume of information currently on the web on any event is enormous and highly noisy, making it extremely difficult to get all specific information. The objective of this course is to build a state-of-the-art information retrieval system in support of the IDEAL project. The class was divided into eight teams, each team being assigned a part of the project that when successfully implemented will enhance the IDEAL project’s functionality. The final product, which will be the culmination of these 8 teams’ efforts, is a fast and efficient search engine for events occurring around the world. This report describes the work completed by the Solr team as a contribution towards searching and retrieving the tweets and web pages archived by IDEAL. If we can visualize the class project as a tree structure, then Solr is the root of the tree, which builds on all other team’s efforts. Hence we actively interacted with all other teams to come up with a generic schema for the documents and their corresponding metadata to be indexed by Solr. As Solr interacts with HDFS via HBase where the data is stored, we also defined an HBase schema and configured the Lily Indexer to set up a fast communication between HBase and Solr. We batch-indexed 8.5 million of the 84 million tweets before encountering memory limitations on the single-node Solr installation. Focusing our efforts therefore on building a search experience around the small collections, we created a 3.4-million tweet collection and a 12,000-webpage collection. Our custom search, which leverages the differential field weights in Solr’s edismax Query Parser and two custom Query Components, achieved precision levels in excess of 90%.	en
dc.description.sponsorship	NSF grant IIS - 1319578	en
dc.identifier.uri	http://hdl.handle.net/10919/52265	en
dc.language.iso	en	en
dc.rights	Creative Commons Attribution-ShareAlike 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-sa/3.0/us/	en
dc.subject	Solr	en
dc.subject	Information Retrieval	en
dc.subject	Hadoop	en
dc.subject	Cloudera	en
dc.subject	HBase	en
dc.title	Solr Team Project Report	en
dc.title.alternative	Solr Team Spring 2015 Report	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en