Elasticsearch (ELS) CS5604 Fall 2019

TR Number

Date

2019-12-12

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

We are building an Information and Retrieval System that will work as a search engine to support searching, ranking, browsing, and recommendations for two large collections of data. The first collection is part of Virginia Tech's collection of Electronic Theses and Dissertations (ETDs). The Virginia Tech Library has a large collection of ETDs. Currently, there is an effort being made to digitize the pre-1997 theses and dissertations and load them into VTechWorks. Our data set contains over 30K ETDs. The second collection is of tobacco settlement documents. There are 14 million documents in this data set. We are using a CEPH container to store and retrieve information.

To achieve its goals, the project has six teams: Collection Management ETDs, Collection Management Tobacco Settlement Documents, Elasticsearch, Front-end and Kibana, Integration and Implementation, and Text Analytics and Machine Learning. This report addresses the work performed by the Elasticsearch team. The Elasticsearch team helps to enable searching and browsing, which are supported based on: facets associated with information extracted from documents, analysis, classification, clustering, summarization, and other processing. The report describes goals, overview, and the process of implementation with Elasticsearch. The Elasticsearch team works closely with the Kibana and Text Machine Learning groups. The data ingested in Elasticsearch is provided to the Front End team for further visualization. Thus, the report also describes the connections established with the other groups, as a high-level overview of the course project. The user manuals have been provided for the reference of other groups.

Description

Keywords

Information Retrieval, Elasticsearch

Citation