Text Data Mining Studio ETDs

Abstract

The goal of the Text Data Mining Studio ETD project was to develop a software system that allows less technically-minded researchers to be able to easily access a vast amount of data to be used for text data analytics. Specifically, the problem that our team addressed was the lack of a centralized tool for analysis of a large amount of text files that would be valuable data to be used for machine learning and other forms of analytics for long form text documents.

Our team created a centralized tool using ElasticSearch, JupyterHub, and Jupyter Notebooks, in the Docker Compose architecture, to provide this service to researchers. We envision this tool being used by researchers whose work involves the ingestion and analysis of large bodies of text. This work could involve producing deep learning based systems for textual analysis and search, for automatic abstract production, or perhaps for trend tracking across highly temporally spaced documents. The tool is intended to be a flexible foundation for these and other research tasks.

During the process of producing this tool, our team learned a great deal about Docker systems, compartmentalization, and JupyterHub, as well as working in a widely distributed team using virtual meeting analogs. Unfortunately the tool is as of yet, incomplete, as the ElasticSearch systems are not linked to the JupyterHub frontend. We are hopeful that given the information outlined in the report that completing the tool will be possible with future work.

Description

Keywords

ETD, OCR, Elasticsearch, Jupyter, JupyterHub, Jupyter Notebooks, Docker, Docker Compose

Citation