Elasticsearch (ELS) CS5604 Fall 2019

We are building an Information and Retrieval System that will work as a search engine to support searching, ranking, browsing, and recommendations for two large collections of data. The first collection is part of Virginia Tech's collection of Electronic Theses and Dissertations (ETDs). The Virginia Tech Library has a large collection of ETDs. Currently, there is an effort being made to digitize the pre-1997 theses and dissertations and load them into VTechWorks. Our data set contains over 30K ETDs. The second collection is of tobacco settlement documents. There are 14 million documents in this data set. We are using a CEPH container to store and retrieve information.

To achieve its goals, the project has six teams: Collection Management ETDs, Collection Management Tobacco Settlement Documents, Elasticsearch, Front-end and Kibana, Integration and Implementation, and Text Analytics and Machine Learning. This report addresses the work performed by the Elasticsearch team. The Elasticsearch team helps to enable searching and browsing, which are supported based on: facets associated with information extracted from documents, analysis, classification, clustering, summarization, and other processing. The report describes goals, overview, and the process of implementation with Elasticsearch. The Elasticsearch team works closely with the Kibana and Text Machine Learning groups. The data ingested in Elasticsearch is provided to the Front End team for further visualization. Thus, the report also describes the connections established with the other groups, as a high-level overview of the course project. The user manuals have been provided for the reference of other groups.

Keywords

Information Retrieval, Elasticsearch

Persistent link

http://hdl.handle.net/10919/96310

Collections

CS5604: Information Retrieval

Full item page

Elasticsearch (ELS) CS5604 Fall 2019

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections