VTechWorks staff will be away for the Thanksgiving holiday beginning at noon on Wednesday, November 27, through Friday, November 29. We will resume normal operations on Monday, December 2. Thank you for your patience.
 

Utilizing Docker and Kafka for Highly Scalable Bulk Processing of Electronic Theses and Dissertations (ETDs)

dc.contributor.authorDinesh, Dhanushen
dc.date.accessioned2023-07-15T22:27:49Zen
dc.date.available2023-07-15T22:27:49Zen
dc.date.issued2023-05-09en
dc.description.abstractThis report discusses the utilization of Docker and Kafka for the bulk processing of Electronic Thesis and Dissertation (ETD) data. Docker, a containerization platform, was used to create portable Docker images that can be deployed on any platform, making them platform-agnostic. However, managing a large infrastructure with interconnected Docker containers can be complicated. To address this, Kafka, an open-source, distributed message streaming platform, was incorporated into the pipeline to make each service independent and scalable. The report provides a comprehensive discussion on how a pipeline was developed to maximize resource utilization and create a highly scalable infrastructure through the use of Docker and Kafka. Multiple Kafka brokers were deployed to ensure high availability and fault tolerance, and Zookeeper was used to track the status of Kafka nodes. Rancher was used to deploy the infrastructure on the cloud, which employs Kubernetes to manage the deployment and services. The report also highlights the advantages of the current setup over previous workflow automation in terms of processing time and parallel processing of data. The system design includes a Kafka producer that produces ETD IDs to be processed, and a segmentation container that acts as a consumer and polls the Kafka broker. Once the ETD IDs are received, the container starts processing, and the segmented chapters are stored in a shared Ceph file space. The process continues until all of the ETDs are processed. This integration has the potential to benefit researchers who require large amounts of ETD data processed at a scale that was previously unfeasible, enabling them to make more robust and data-driven conclusions.en
dc.description.notesFinal_Project_and_report.pdf - PDF version of the report Final_Project_and_report.docx - Word version of the report Final_Project_and_report_Dhanush-slides.pptx - PowerPoint version of the presentation Final_Project_and_report_Dhanush-slides.pdf - PDF version of the presentationen
dc.identifier.urihttp://hdl.handle.net/10919/115782en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCC0 1.0 Universalen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectKafkaen
dc.subjectDockeren
dc.subjectETDen
dc.subjectkubernetesen
dc.titleUtilizing Docker and Kafka for Highly Scalable Bulk Processing of Electronic Theses and Dissertations (ETDs)en
dc.typeMaster's projecten
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 4 of 4
Name:
Final_Project_and_report_Dhanush-slides.pptx
Size:
3.84 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
Final_Project_and_report_Dhanush-slides.pdf
Size:
526.47 KB
Format:
Adobe Portable Document Format
Name:
Final_Project_and_report.docx
Size:
3.13 MB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
Final_Project_and_report.pdf
Size:
4.23 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: