Solr Project with IDEAL, in CS5604 (Information Storage and Retrieval)

Abstract

This submission describes the work of the Solr team as part of the IDEAL project with the main goal of designing and developing a distributed search infrastructure. It includes the project reports, final presentations, as well as the solutions (configuration files & Java code) developed. The main responsibility of our team was to configure Near Real Time Indexing and implement Custom Ranking for tweets and web page collections. The idea behind NRT Indexing is to help perform incremental updates from an HBase table into the Solr index, thereby optimizing time utilized and compute resources. The main motivation behind the Custom Ranking solution is to improve system precision and recall by transforming user queries with the use of the metadata provided by the other teams. The implementation leverages these three techniques: Query Expansion, Psuedo Relevance Feedback and Query Boosting. Throughout the semester we closely collaborated with several other teams both in getting requirements and the input data.

Description

This submission describes the work of the Solr team as part of the IDEAL project with the main goal of designing and developing a distributed search infrastructure. It includes the project reports, final presentations as well as the solutions (configuration files & Java code) developed.

Keywords

IDEAL, Solr, Lucene, Custom Ranking, Query Expansion, Near Real Time Indexing, Batch-Indexing, Morphline, Lily Indexer, Cloudera Search, Pseudo relevance feedback

Citation