CS5604 Information Storage and Retrieval Fall 2017 Solr Report

dc.contributor.authorKumar, Abhinaven
dc.contributor.authorBangad, Ananden
dc.contributor.authorRobertson, Jeffen
dc.contributor.authorGarg, Mohiten
dc.contributor.authorRamesh, Shreyasen
dc.contributor.authorMi, Siyuen
dc.contributor.authorWang, Xinyueen
dc.contributor.authorWang, Yuen
dc.date.accessioned2018-01-16T01:54:27Zen
dc.date.available2018-01-16T01:54:27Zen
dc.date.issued2018-01-15en
dc.description.abstractThe Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets and millions of webpages for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. We are using a 21 node Cloudera Hadoop cluster to store and retrieve this information. One goal of this project is to expand the data collection to include more web archives and geospatial data beyond what previously had been collected. Another important part in this project is optimizing the current system to analyze and allow access to the new data. To accomplish these goals, this project is separated into 6 parts with corresponding teams: Classification (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Front-end (FE), and SOLR. The report describes the work completed by the SOLR team which improves the current searching and storage system. We include the general architecture and an overview of the current system. We present the part that Solr plays within the whole system with more detail. We talk about our goals, procedures, and conclusions on the improvements we made to the current Solr system. This report also describes how we coordinate with other teams to accomplish the project at a higher level. Additionally, we provide manuals for future readers who might need to replicate our experiments. The main components within the Cloudera Hadoop cluster that the SOLR team interacts with include: Solr searching engine, HBase database, Lily indexer, Hive database, HDFS file system, Solr recommendation plugin, and Mahout. Our work focuses on HBase design, data quality control, search recommendations, and result ranking. Overall, throughout the semester, we have processed 12,564 web pages and 5.9 million tweets. In order to cooperate with Geo Blacklight, we make major changes on the Solr schema. We also function as a data quality control gateway for the Front End team and deliver the finalized data for them. As to search recommendation, we provide search recommendation such as the MoreLikeThis plugin within Solr for recommending related records from search results, and a custom recommendation system based on user behavior to provide user based search recommendations. After the fine tuning over the final weeks of semester, we successfully allowed effective connection of results from data provided by other teams, and delivered them to the front end through a Solr core.en
dc.description.notesIncluded are the following files: Final_presentation.pdf - final class presentation in PDF; Final_presentation.pptx - final class presentation in PowerPoint format; final-report-5604.pdf - final report in PDF; final-report-5604.zip - final report archive from LaTex project in Overleaf; cs5604f17_solr_code.tgz - code developed for Solr including recommender.en
dc.description.sponsorshipNSF IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/81794en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/en
dc.subjectInformation Retrievalen
dc.subjectSolren
dc.subjectCS5604en
dc.subjectIndexingen
dc.titleCS5604 Information Storage and Retrieval Fall 2017 Solr Reporten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
cs5604f17_solr_code.tgz
Size:
58.23 MB
Format:
Unknown data format
Name:
final-report-5604.zip
Size:
4.03 MB
Format:
Loading...
Thumbnail Image
Name:
final-report-5604.pdf
Size:
3.33 MB
Format:
Adobe Portable Document Format
Name:
Final_presentation.pptx
Size:
1.22 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
Final_presentation.pdf
Size:
537.64 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: