Show simple item record

dc.contributor.authorGruss, Richard
dc.contributor.authorChoudhury, Ananya
dc.contributor.authorKomawar, Nikhil
dc.date.accessioned2015-05-13T16:39:18Z
dc.date.available2015-05-13T16:39:18Z
dc.date.issued2015-05-13
dc.identifier.urihttp://hdl.handle.net/10919/52265
dc.descriptionSolr Team Deliverables for Spring 2015 Information Retrieval.en_US
dc.description.abstractThe Integrated Digital Event Archive and Library (IDEAL) is a Digital Library project that aims to collect, index, archive and provide access to digital contents related to important events, including disasters, man-made or natural. It extracts event data mostly from social media sites such as Twitter and crawls related web. However, the volume of information currently on the web on any event is enormous and highly noisy, making it extremely difficult to get all specific information. The objective of this course is to build a state-of-the-art information retrieval system in support of the IDEAL project. The class was divided into eight teams, each team being assigned a part of the project that when successfully implemented will enhance the IDEAL project’s functionality. The final product, which will be the culmination of these 8 teams’ efforts, is a fast and efficient search engine for events occurring around the world. This report describes the work completed by the Solr team as a contribution towards searching and retrieving the tweets and web pages archived by IDEAL. If we can visualize the class project as a tree structure, then Solr is the root of the tree, which builds on all other team’s efforts. Hence we actively interacted with all other teams to come up with a generic schema for the documents and their corresponding metadata to be indexed by Solr. As Solr interacts with HDFS via HBase where the data is stored, we also defined an HBase schema and configured the Lily Indexer to set up a fast communication between HBase and Solr. We batch-indexed 8.5 million of the 84 million tweets before encountering memory limitations on the single-node Solr installation. Focusing our efforts therefore on building a search experience around the small collections, we created a 3.4-million tweet collection and a 12,000-webpage collection. Our custom search, which leverages the differential field weights in Solr’s edismax Query Parser and two custom Query Components, achieved precision levels in excess of 90%.en_US
dc.description.sponsorshipNSF grant IIS - 1319578en_US
dc.language.isoenen_US
dc.rightsAttribution-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/*
dc.subjectSolren_US
dc.subjectInformation Retrievalen_US
dc.subjectHadoopen_US
dc.subjectClouderaen_US
dc.subjectHBaseen_US
dc.titleSolr Team Project Reporten_US
dc.title.alternativeSolr Team Spring 2015 Reporten_US
dc.typePresentationen_US
dc.typeSoftwareen_US
dc.typeTechnical reporten_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution-ShareAlike 3.0 United States
License: Attribution-ShareAlike 3.0 United States