Utilizing Docker and Kafka for Highly Scalable Bulk Processing of Electronic Theses and Dissertations (ETDs)

TR Number

Date

2023-05-09

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

This report discusses the utilization of Docker and Kafka for the bulk processing of Electronic Thesis and Dissertation (ETD) data. Docker, a containerization platform, was used to create portable Docker images that can be deployed on any platform, making them platform-agnostic. However, managing a large infrastructure with interconnected Docker containers can be complicated. To address this, Kafka, an open-source, distributed message streaming platform, was incorporated into the pipeline to make each service independent and scalable. The report provides a comprehensive discussion on how a pipeline was developed to maximize resource utilization and create a highly scalable infrastructure through the use of Docker and Kafka. Multiple Kafka brokers were deployed to ensure high availability and fault tolerance, and Zookeeper was used to track the status of Kafka nodes. Rancher was used to deploy the infrastructure on the cloud, which employs Kubernetes to manage the deployment and services.

The report also highlights the advantages of the current setup over previous workflow automation in terms of processing time and parallel processing of data. The system design includes a Kafka producer that produces ETD IDs to be processed, and a segmentation container that acts as a consumer and polls the Kafka broker. Once the ETD IDs are received, the container starts processing, and the segmented chapters are stored in a shared Ceph file space. The process continues until all of the ETDs are processed. This integration has the potential to benefit researchers who require large amounts of ETD data processed at a scale that was previously unfeasible, enabling them to make more robust and data-driven conclusions.

Description

Keywords

Kafka, Docker, ETD, kubernetes

Citation