AWS Document Retrieval

Abstract

In the course CS5604 Information Retrieval, the class built a functioning search engine/information retrieval system on the Computer Science Container Cluster. The objective of the original project was to create a system that allows users to request Electronic Theses and Dissertations (ETDs) and Tobacco Settlement Documents using various fields, through their queries. The objective of our project is to migrate this system onto Amazon Web Services (AWS) so that the system can be stood up independently from Virginia Tech’s infrastructure. AWS was chosen due to its robust nature. The system itself needs to be able to store the documents in an accessible way. This was accomplished by setting up a pipeline that will stream data directly to the search engine using AWS S3 buckets. Each of the two document types were placed into their own S3 bucket. We set up an RDS instance for login verification. This database is used to store user information as they sign-up with the front-end application and will be referenced when the application is validating a user’s login attempt. This instance is publicly accessible and can connect to developer environments outside of the AWS group with the right endpoint and admin credentials. We worked with our client to set up an ElasticSearch instance to ingest the documents along with communicating and manage the health of the instance. This instance is accessible to all of us with permissions and we are able to manually ingest data using cURL commands in the command line. Once the login verification database and ElasticSearch search engine were properly implemented, we had to connect both components to the front-end application where users could create accounts and search for desired documents. After both were connected and all features were working properly, we used Docker to create a container for the front-end application. To migrate the front-end to AWS, we used the Elastic Container Registry (ECR) to push our front-end container image to AWS and store it in a registry. Then we used an ECS cluster running AWS Fargate, a serverless-compute engine for containers, to deploy the front-end to the network for all users to access. Additionally, we implemented data streaming using AWS Lambda so that new entries can be automatically ingested into our ElasticSearch instance. We note that the system is not in a fully demonstrable state due to conflicts with the expected data fields. However, the infrastructure around the various components is established and would just need proper data to read. Overall, our team was able to learn many aspects of standing up and building the infrastructure of the project on AWS, along with learning to utilize many different Amazon services. The new system serves as a functioning proof of concept that would allow a feasible alternative other than relying on Virginia Tech’s system.

Description

Keywords

AWS, Document Retrieval, ElasticSearch, ETD, Tobacco Settlement Documents, Docker, MySQL, Login Verification, Common Storage, Flask Application, Kibana, UCSF Deposition Documents

Citation