Show simple item record

dc.contributor.authorKumar, Abhinav
dc.contributor.authorBangad, Anand
dc.contributor.authorRobertson, Jeff
dc.contributor.authorGarg, Mohit
dc.contributor.authorRamesh, Shreyas
dc.contributor.authorMi, Siyu
dc.contributor.authorWang, Xinyue
dc.contributor.authorWang, Yu
dc.date.accessioned2018-01-16T01:54:27Z
dc.date.available2018-01-16T01:54:27Z
dc.date.issued2018-01-15
dc.identifier.urihttp://hdl.handle.net/10919/81794
dc.description.abstractThe Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets and millions of webpages for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. We are using a 21 node Cloudera Hadoop cluster to store and retrieve this information. One goal of this project is to expand the data collection to include more web archives and geospatial data beyond what previously had been collected. Another important part in this project is optimizing the current system to analyze and allow access to the new data. To accomplish these goals, this project is separated into 6 parts with corresponding teams: Classification (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Front-end (FE), and SOLR. The report describes the work completed by the SOLR team which improves the current searching and storage system. We include the general architecture and an overview of the current system. We present the part that Solr plays within the whole system with more detail. We talk about our goals, procedures, and conclusions on the improvements we made to the current Solr system. This report also describes how we coordinate with other teams to accomplish the project at a higher level. Additionally, we provide manuals for future readers who might need to replicate our experiments. The main components within the Cloudera Hadoop cluster that the SOLR team interacts with include: Solr searching engine, HBase database, Lily indexer, Hive database, HDFS file system, Solr recommendation plugin, and Mahout. Our work focuses on HBase design, data quality control, search recommendations, and result ranking. Overall, throughout the semester, we have processed 12,564 web pages and 5.9 million tweets. In order to cooperate with Geo Blacklight, we make major changes on the Solr schema. We also function as a data quality control gateway for the Front End team and deliver the finalized data for them. As to search recommendation, we provide search recommendation such as the MoreLikeThis plugin within Solr for recommending related records from search results, and a custom recommendation system based on user behavior to provide user based search recommendations. After the fine tuning over the final weeks of semester, we successfully allowed effective connection of results from data provided by other teams, and delivered them to the front end through a Solr core.en_US
dc.description.sponsorshipNSF IIS-1619028en_US
dc.language.isoen_USen_US
dc.publisherVirginia Techen_US
dc.rightsAttribution 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/*
dc.subjectInformation Retrievalen_US
dc.subjectSolren_US
dc.subjectCS5604en_US
dc.subjectIndexingen_US
dc.titleCS5604 Information Storage and Retrieval Fall 2017 Solr Reporten_US
dc.typePresentationen_US
dc.typeReporten_US
dc.typeSoftwareen_US
dc.description.notesIncluded are the following files: Final_presentation.pdf - final class presentation in PDF; Final_presentation.pptx - final class presentation in PowerPoint format; final-report-5604.pdf - final report in PDF; final-report-5604.zip - final report archive from LaTex project in Overleaf; cs5604f17_solr_code.tgz - code developed for Solr including recommender.en_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution 3.0 United States
License: Attribution 3.0 United States