Solr Team Project Report

dc.contributor.authorGruss, Richarden
dc.contributor.authorChoudhury, Ananyaen
dc.contributor.authorKomawar, Nikhilen
dc.date.accessioned2015-05-13T16:39:18Zen
dc.date.available2015-05-13T16:39:18Zen
dc.date.issued2015-05-13en
dc.descriptionSolr Team Deliverables for Spring 2015 Information Retrieval.en
dc.description.abstractThe Integrated Digital Event Archive and Library (IDEAL) is a Digital Library project that aims to collect, index, archive and provide access to digital contents related to important events, including disasters, man-made or natural. It extracts event data mostly from social media sites such as Twitter and crawls related web. However, the volume of information currently on the web on any event is enormous and highly noisy, making it extremely difficult to get all specific information. The objective of this course is to build a state-of-the-art information retrieval system in support of the IDEAL project. The class was divided into eight teams, each team being assigned a part of the project that when successfully implemented will enhance the IDEAL project’s functionality. The final product, which will be the culmination of these 8 teams’ efforts, is a fast and efficient search engine for events occurring around the world. This report describes the work completed by the Solr team as a contribution towards searching and retrieving the tweets and web pages archived by IDEAL. If we can visualize the class project as a tree structure, then Solr is the root of the tree, which builds on all other team’s efforts. Hence we actively interacted with all other teams to come up with a generic schema for the documents and their corresponding metadata to be indexed by Solr. As Solr interacts with HDFS via HBase where the data is stored, we also defined an HBase schema and configured the Lily Indexer to set up a fast communication between HBase and Solr. We batch-indexed 8.5 million of the 84 million tweets before encountering memory limitations on the single-node Solr installation. Focusing our efforts therefore on building a search experience around the small collections, we created a 3.4-million tweet collection and a 12,000-webpage collection. Our custom search, which leverages the differential field weights in Solr’s edismax Query Parser and two custom Query Components, achieved precision levels in excess of 90%.en
dc.description.sponsorshipNSF grant IIS - 1319578en
dc.identifier.urihttp://hdl.handle.net/10919/52265en
dc.language.isoenen
dc.rightsCreative Commons Attribution-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/en
dc.subjectSolren
dc.subjectInformation Retrievalen
dc.subjectHadoopen
dc.subjectClouderaen
dc.subjectHBaseen
dc.titleSolr Team Project Reporten
dc.title.alternativeSolr Team Spring 2015 Reporten
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 12
Loading...
Thumbnail Image
Name:
Solr Team Final Presentation.pdf
Size:
1.24 MB
Format:
Adobe Portable Document Format
Description:
Final Presentation
Name:
schema.xml
Size:
23.13 KB
Format:
Plain Text
Description:
Tweet Solr Schema
Name:
schema 2.xml
Size:
22.71 KB
Format:
Plain Text
Description:
Webpages Solr Schema
Name:
solrconfig.xml
Size:
74.7 KB
Format:
Plain Text
Description:
Solr Configuration that customizes query processing
Name:
tweet_morphlines.conf
Size:
10.3 KB
Format:
Plain Text
Description:
Morphline file for indexing tweets from HBase to Solr
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: