CS5604 Fall 2016 Solr Team Project Report

TR Number

Date

2016-12-07

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

This submission describes the work the SOLR team completed in Fall 2016. It includes the final report and presentation, as well as key relevant materials (indexing scripts & Java code). Based on the work in Spring 2016, the SOLR team improved the general search infrastructure supporting the IDEAL and GETAR projects, both funded by NSF. The main responsibility was to configure the Basic Indexing and Incremental Indexing (Near Real Time, NRT Indexing) for tweets and web page collections in DLRL's Hadoop Cluster. The goal of Basic Indexing was to index the big collection that contains more than 1.2 billion tweets. The idea of NRT Indexing was to monitor real-time changes in HBase and update the Solr results as appropriate. The main motivation behind the Custom Ranking was to design and implement a new scoring function to re-rank the retrieved results in Solr. Based on the text similarity, a basic document recommender was also created to retrieve the similar documents related to a specific one. Finally, new well written manuals could be easier for users and developers to read and get familiar with Solr and its relevant tools. Throughout the semester we closely collaborated with the Collection Management Tweets (CMT), Collection Management Webpages (CMW), Classification (CLA), Clustering and Topic Analysis (CTA), and Front-End (FE) teams in getting requirements, input data, and suggestions for data visualization.

Description

Keywords

Solr, Cloudera, Hadoop Cluster, IDEAL, GETAR, Custom Ranking, Incremental Indexing, Recommendation

Citation