Utilizing Docker and Kafka for Highly Scalable Bulk Processing of Electronic Theses and Dissertations (ETDs)

Dinesh, Dhanush

Utilizing Docker and Kafka for Highly Scalable Bulk Processing of Electronic Theses and Dissertations (ETDs)

dc.contributor.author	Dinesh, Dhanush	en
dc.date.accessioned	2023-07-15T22:27:49Z	en
dc.date.available	2023-07-15T22:27:49Z	en
dc.date.issued	2023-05-09	en
dc.description.abstract	This report discusses the utilization of Docker and Kafka for the bulk processing of Electronic Thesis and Dissertation (ETD) data. Docker, a containerization platform, was used to create portable Docker images that can be deployed on any platform, making them platform-agnostic. However, managing a large infrastructure with interconnected Docker containers can be complicated. To address this, Kafka, an open-source, distributed message streaming platform, was incorporated into the pipeline to make each service independent and scalable. The report provides a comprehensive discussion on how a pipeline was developed to maximize resource utilization and create a highly scalable infrastructure through the use of Docker and Kafka. Multiple Kafka brokers were deployed to ensure high availability and fault tolerance, and Zookeeper was used to track the status of Kafka nodes. Rancher was used to deploy the infrastructure on the cloud, which employs Kubernetes to manage the deployment and services. The report also highlights the advantages of the current setup over previous workflow automation in terms of processing time and parallel processing of data. The system design includes a Kafka producer that produces ETD IDs to be processed, and a segmentation container that acts as a consumer and polls the Kafka broker. Once the ETD IDs are received, the container starts processing, and the segmented chapters are stored in a shared Ceph file space. The process continues until all of the ETDs are processed. This integration has the potential to benefit researchers who require large amounts of ETD data processed at a scale that was previously unfeasible, enabling them to make more robust and data-driven conclusions.	en
dc.description.notes	Final_Project_and_report.pdf - PDF version of the report Final_Project_and_report.docx - Word version of the report Final_Project_and_report_Dhanush-slides.pptx - PowerPoint version of the presentation Final_Project_and_report_Dhanush-slides.pdf - PDF version of the presentation	en
dc.identifier.uri	http://hdl.handle.net/10919/115782	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	CC0 1.0 Universal	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	Kafka	en
dc.subject	Docker	en
dc.subject	ETD	en
dc.subject	kubernetes	en
dc.title	Utilizing Docker and Kafka for Highly Scalable Bulk Processing of Electronic Theses and Dissertations (ETDs)	en
dc.type	Master's project	en
dc.type	Presentation	en
dc.type	Report	en