CS5604: Information Storage and Retrieval
Collection Management of Electronic Theses and
Dissertations
Authors
Kulendra Kumar Kaushal
Rutwik Kulkarni
Aarohi Sumant
Chaoran Wang
Chenhan Yuan
Liling Yuan
Instructor
Dr. Edward A. Fox
Department of Computer Science
Virginia Tech
Blacksburg, VA 24061
December 24, 2019
CS5604: Information Storage and Retrieval
Team CME
This research was done under the supervision of Dr. Edward A. Fox as part of the course
CS5604.
4th edition, December 7, 2019
3rd edition, October 31, 2019
2nd edition, October 10, 2019
1st edition, September 19, 2019
Contents
List of Figures vii
List of Tables viii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 VTechWorks ETD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 4
2.1 PDF Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Evaluation of Open-Source Bibliographic Reference and Citation
Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Big Data Text Summarization . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 GROBID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.5 Science Parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.6 Apache Tika . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.7 PDFMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.8 PyPDF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Requirements 9
3.1 Extract Metadata and Text for ETD Corpus . . . . . . . . . . . . . . . . . 9
3.2 Preprocess the ETD corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 User Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Approach, Design, Implementation 11
4.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iii
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2.1 Chapter Level Text Extraction . . . . . . . . . . . . . . . . . . . . . 11
4.2.2 TF-IDF Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.3 Transforming Metadata for Ingestion in Elasticsearch . . . . . . . 18
4.2.4 Development of an Automated System . . . . . . . . . . . . . . . . 19
4.2.5 List of Visualizations to be Provided in the Front End . . . . . . . . 22
4.2.6 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Evaluation 24
5.1 Manual Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.1 Testing of Chapter Level Text Extraction . . . . . . . . . . . . . . . 24
5.1.2 Testing of Extracted Text Preprocessing . . . . . . . . . . . . . . . 26
5.1.3 Metadata Extraction Testing . . . . . . . . . . . . . . . . . . . . . . 28
5.1.4 Automated Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 User Manual 30
6.1 Where to Get Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1.1 VTechWorks ETD collection . . . . . . . . . . . . . . . . . . . . . . 30
6.1.2 GitLab Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1.3 Metadata Extraction and Ingestion in Ceph . . . . . . . . . . . . . 32
7 Developer’s Manual 36
7.1 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3 GROBID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3.1 Install in Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.4 PDFMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8 Challenges and Limitations 41
9 Future Scope 42
9.1 Improving Chapter Level Text Extraction . . . . . . . . . . . . . . . . . . 42
9.2 Batch Processing of the Documents . . . . . . . . . . . . . . . . . . . . . . 42
9.3 Improving Automation Suite . . . . . . . . . . . . . . . . . . . . . . . . . 42
10 Acknowledgements 43
iv
Bibliography 44
v
Abstract
The class “CS 5604: Information Storage and Retrieval” in the fall of 2019 is
divided into six teams to enhance the usability of the corpus of electronic theses
and dissertations maintained by Virginia Tech University Libraries. The ETD cor-
pus consists of 14,055 doctoral dissertations and 19,246 masters theses from Vir-
ginia Tech University Libraries’ VTechWorks system. Our study explored document
collection and processing, application of Elasticsearch to the collection to facilitate
searching, testing a custom front-end, Kibana, integration, implementation, text an-
alytics, and machine learning. The result of our work would help future researchers
study the natural language processed data using deep learning technologies, address
the challenges of extracting information from ETDs, etc.
The Collection Management of Electronic Theses and Dissertations (CME) team
was responsible for processing all PDF les from the ETD corpus and extracting
the well-formatted text les from them. We also used advanced deep learning and
other tools like GROBID to process metadata, obtain text documents, and generate
chapter-wise data. In this project, the CME team completed the following steps:
comparing dierent parsers; doing document segmentation; preprocessing the data;
and specifying, extracting, and preparing metadata and auxiliary information for
indexing. We nally developed a system that automates all the above-mentioned
tasks. The system also validates the output metadata, thereby ensuring the correct-
ness of the data that ows through the entire system developed by the class. This
system, in turn, helps to ingest new documents into Elasticsearch.
vi
List of Figures
1.1 Position in entire system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 The architecture of PDF Miner . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Folder structure of a ETD after chapter level text extraction . . . . . . . . 13
4.2 Sample ETD Introduction chapter . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Parsed text of the same document (highlighted text indicates end of page
shown in Figure 4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Part of TF-IDF of one document . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Part of BOW of one document . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Part of doc-index dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Flow diagram of the automated system . . . . . . . . . . . . . . . . . . . . 21
4.8 Folder structure of an ETD . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.9 GROBID unit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1 Chapter level text extraction by XPath vs. manual extraction by Di Checker 25
5.2 Original text generated by PDFMiner.six . . . . . . . . . . . . . . . . . . . 26
5.3 Processed text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1 GitLab le structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 GROBID Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Python client to access GROBID . . . . . . . . . . . . . . . . . . . . . . . 32
7.1 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Files in the Gradle folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.4 Files in the GROBID folder . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
List of Tables
2.1 Human assessment of GROBID and Science Parse outputs . . . . . . . . . 6
5.1 Chapter level text extraction by XPath and manual extraction . . . . . . . 25
5.2 Dierences between chapter level text extraction by XPath and manually
extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Dierent test case scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
viii
Chapter 1
Introduction
1.1 Overview
As a leading global research university, on January 1, 1997, Virginia Tech was the rst
university which required graduate students to submit electronic theses and dissertations
(ETDs) [21]. As of 2019, the local ETD dataset covers over 33,000 doctoral dissertations
or masters theses.
ETDs are valuable information sources, but due to the lack of discoverability, they are
still underutilized. Hence, retrieving ETDs is important for researchers and universities.
Retrieving specic information from academic materials has many important applica-
tions, such as citation analysis [10]. It could aid those working to prepare award-winning
theses [9]. One of the most important problems in ETD information retrieval is how to
extract text and metadata properly from PDF les. In this report, we will address that
problem, and also tackle problems related to the identication and extraction of sections
and chapters. We hope our work would help future researchers to be able to discover and
reuse the potential useful resources from the ETDs.
The position of our team in the whole system is shown in Fig. 1.1. Many dierent PDF
parsers [3, 5, 17] are implemented to convert PDF les to a structured format, e.g., XML
or JSON. To extract metadata or elements – like aliation, tables, and images – from
ETDs successfully, we also propose a new approach to avoid errors during conversion.
Moreover, the issue of automatic segmentation to identify sections and chapters is also
addressed in this project.
1
Figure 1.1: Position in entire system
1.2 VTechWorks ETD Dataset
The ETD corpus is downloaded from the Virginia Tech institutional repository, VTech-
Works, and consists of over 33,000 documents: 14,055 doctoral dissertations, 19,246 mas-
ters theses, and some award-wining and undergraduate theses. The repository is main-
tained by the university library, and includes ETDs about all disciplines from all depart-
ments of Virginia Tech. For each ETD, there is one PDF document which is generally the
main part, a metadata record, and some supporting documents. For older ETDs, the PDF
2
les resulted from scanned paper documents. In such cases, full-text les were extracted
using optical character recognition.
1.3 Problem Denition
This project works on managing ETDs by answering the following research questions.
RQ1: Can we extract metadata from an ETD document, and transform it into a format
that can be ingested into Elasticsearch?
Elasticsearch is a search server based on the Lucene library. Lucene is a open-source
search engine software library. Elasticsearch provides a distributed, multi-tenant-
capable full-text search engine with a RESTful web interface and schema-free JSON
documents [7]. Enhancing our output to generate according to a suitable format that
can be ingested by Elasticsearch should extend the applicability of our work.
RQ2: Can we extract text les from PDF les and have content suitable for subsequent
indexing and searching?
A suitable structure, properly populated with text that is used in the subsequent indexing,
would help future researchers to discover and retrieve the specic information they need.
RQ3: Can we expand the extracted data by including a le for each chapter?
Sometimes the researchers might be just interested in some specic sections. This action
might be helpful to increase search specicity and save time for users.
RQ4: Can we develop an automated system that can extract the metadata from new
documents, process it, and ingest it to Elasticsearch?
New ETDs need to be added to our system as and when they are added to VTechWorks.
So, in order to make our system more robust and up to date, an automated system to
process and add the new ETDs to our system is necessary.
3
Chapter 2
Literature Review
2.1 PDF Processing
2.1.1 Overview
All of our electronic theses and dissertations are available as PDF les. It is dicult to
extract the key data from such a le. Additionally, the formatting of dierent sections, as
well as of the bibliography, changes from document to document. Thus, parsing a PDF
le becomes a big challenge.
Preprocessing and extraction of metadata from the ETDs are important steps in re-
lated works that have been carried out in this domain. The rest of this chapter includes
descriptions of some of the work done by researchers related to the extraction of meta-
data, text parsing, and providing support for big data text summarization. We include
descriptions of popular tools and parsers, and highlight the comparison between them
on dierent parameters, as discussed in various works.
2.1.2 Evaluation of Open-Source Bibliographic Reference and Ci-
tation Parsers
The growth in the volume of available scientic literature has resulted in a scientic
information overload problem, which refers to the end user being overwhelmed by the
abundance of information. To leverage the information available in that literature, there
is a need for intelligent information retrieval systems to provide desired information in
an organised manner.
4
One such type of information is machine-readable rich bibliographic metadata. As
a consequence, there is demand for tools which can parse scientic documents and ex-
tract the bibliographic content. Researchers have devised interesting solutions. Regular
expressions, template matching, knowledge bases, and supervised machine learning all
relate to solutions proposed. Software tools have been proposed, such as Biblio (regular
expression based), Bibpro (template matching based), Citation Parser (knowledge based
or rule based), and GROBID (ML or machine learning based) [20]. The quality, measured
using precision, of machine learning (ML) based tools, is similar to that of tools employ-
ing rules, regular expressions, or template matching (0.77 for ML-based tools vs. 0.76
for non-ML-based tools). However, ML based tools are popular and are often preferred
because of also achieving higher recall (0.66 vs. 0.22) [20]. Only a few tools like GROBID
(F1=0.89), Cermine (F1=0.83), and ParsCit (F1=0.75) have performed reasonably well. Re-
training with task-specic data denitely increases the performance of almost all of the
tools. Thus, the F1 measure of GROBID increased by 3% (0.89 to 0.92), Cermine achieved
F1 increases of 11% (0.83 to 0.92), and ParsCit had an increase of F1 by 16% (0.75 to 0.87)
[20].
2.1.3 Big Data Text Summarization
For summarizing Electronic Theses and Dissertations (ETDs), three Fall 2018 student
teams in Virginia Tech CS4984/5984 (Big Data Text Summarization) [14, 6, 8] used
Science Parse and GROBID to extract information from PDFs. Both GROBID and
Science Parse have their respective pros and cons. Table 2.1 summarises how GROBID
outperforms Science Parse in many situations [21].
2.1.4 GROBID
GROBID (GeneRation Of BIbliographic Data) is a parser which is used to extract meta-
data from a PDF document into XML format. GROBID takes the PDF of each scholarly
document as input and makes use of machine learning models (cascading of linear-chain
CRF) for extracting the metadata from the document in XML format. It uses the lexical
(POS), layout (font, font size), and position information (start/end) of a line in a document
in order to train the models and obtain the metadata in the required format. It does not
5
Table 2.1: Human assessment of GROBID and Science Parse outputs
GROBID Science Parse
Output File XML JSON
Format
Adds table of contents and list Maintains order of table of
Table of
of gures at the end. contents and list of gures.
Contents
Occasionally misses the Often detects the abstract
Abstract
abstract. correctly.
Occasionally skips chapters
especially in case of ETDs of
disciplines such as Often skips chapters and
Chapters
Architecture where there are a merges some chapters together.
large number of images
present along with the text.
Adds a <ref type="figure"> Does not indicate the existence
Figures tag to indicate the existence of of a gure; often appends the
a gure. gure title as part of the text.
Adds a <ref type="table">
Does not indicate the existence
Tables tag to indicate the existence of
of a table.
a table.
Parses the reference string into
Parses the reference string into
title, author, venue,
author - first and last
References year. Does not further split
name, publication,
these values. Skips some
volume, issue, published.
references while extracting.
6
provide an explicit <chapter> tag. Therefore, chapter-level text and metadata extraction
from the ETD documents is a challenging task using GROBID [3] [13].
2.1.5 Science Parse
Science Parse parses the scientic documents from PDF into structured JSON format. It is
a combination of Java and Scala and can be used as a library in any JVM-based language.
Science Parse can be used in three dierent ways:
• Server: It functions as a wrapper and makes Science Parse available as a web ser-
vice. It uses heap memory (about 2GB).
• CLI: Science Parse has a command line interface known as RunSP. It uses heap
memory (about 6GB). RunSP can also be used to parse multiple les at a time.
• Core: It provides exibility in Science Parse but is also quite complex to use as a
library. Four model les – general CRF model for extracting title and authors; and
a CRF model for each of bibliographies, gazetteer, and word vectors – are available
in this service.
Science Parse is dicult to set up and sometimes skips or merges some of the content
[19][5].
2.1.6 Apache Tika
Apache Tika is a le extraction framework which is written in Java. The big advantage
of Tika is that “it can extract over thousands of dierent types of les to metadata and
text” [2]. In addition, another powerful capability that Tika has is that this library can
extract the image metadata from Portable Document Format (PDF) les. However, it is
hard to get the image itself compared to getting the metadata of this image. At the same
time, since Apache Tika is written in Java, it is complicated to set it up if users are using
other programming languages. Another disadvantage is that Tika can only extract PDF
to text, which means chapter-wise extraction is dicult.
2.1.7 PDFMiner
PDF Miner.six (or PDFMiner) is a Python-compatible parser that can convert PDF les
into text, HTML, or XML. The architecture of PDFMiner is shown as Figure 2.1. As a
7
rule-based parser, PDFMiner runs eciently. Tested with an ETD document, PDFMiner
converts PDF to text or other formats using around 18s. Moreover, it supports various
font types and CJK language extraction [17]. Practically, it can extract specic pages and
tables (output without structure) from a PDF le. However, because PDFMiner is used
to extract text data, the ability to process images and tables in PDF les is still unstable
according to its document.
Figure 2.1: The architecture of PDF Miner
2.1.8 PyPDF2
PyPDF2 is a Python based tool for extraction of metadata and text from a PDF le. It also
allows splitting, merging, and extraction of data from the le. Predominantly it is used
for the extraction of text from a PDF le. It works on StringIO objects as opposed to le
streams and so allows for PDF manipulation in memory [4].
8
Chapter 3
Requirements
In this project, the CME team is responsible for extracting metadata and text from the
ETD documents. By the end of this project, we intend to nish the jobs listed below.
• Convert ETD documents from PDF to text format to enable full text search.
• Extract metadata for each ETD document.
• Extract chapter-level text from ETDs.
• Preprocess the ETD corpus, i.e., tokenize, lemmatize, and remove stopwords.
• Develop a pipeline to enable ingestion of new ETDs into Elasticsearch.
3.1 Extract Metadata and Text for ETD Corpus
Metadata containing elds like names of author, date of publication, author email,
contributor department, etc. has been extracted and put into ceph (mnt/ceph/cme). It
contains both the data of a small ETD dataset subset (i.e., the 2017 ETDs) which includes
691 PDF documents, and the large dataset (all 30K ETDs). Each folder contains PDF as
well as text les of the theses/dissertations.
9
3.2 Preprocess the ETD corpus
We have performed tokenization and stopword removal on the ETD corpus. This should
help the Text Analysis and Machine Learning team to cluster the documents eciently.
3.3 User Support
Currently, the IP address of the GROBID server is static. Other users are allowed to extract
metadata from PDF les in any environment by using the URL we provided. An auto-
mated system is also provided through which a user can run a driver script to implement
all the tasks, from extraction of metadata from PDF to its ingestion into Elasticsearch.
Details regarding the same are provided in Section 6.1.3.
10
Chapter 4
Approach, Design, Implementation
4.1 Experiment Design
This project addresses problems related to management of ETDs by answering the re-
search questions that were listed in the problem denition of Section 1.3.
ETDs in our database are mostly in the form of PDF documents. The main objective
is to parse and extract metadata from the ETDs. However, it is dicult to perform this
action on the PDF les since they do not contain tags to delimit their elements. The
structures of PDF les are often dierent, and vary according to the domain. To over-
come these limitations, suitable machine learning tools need to be used which can extract
metadata and represent all the ETDs in the same format.
After exploring and evaluating all the mentioned parsers, as discussed in Section 2.1,
we decided to use GROBID for extracting metadata.
4.2 Implementation
4.2.1 Chapter Level Text Extraction
XPath-based Chapter Level Text Extraction
Projects like [14, 6, 8] have successfully used GROBID [3] for capturing the structure of
ETD documents. Therefore, due to previous successful usage and ease of installation,
we decided to use GROBID for chapter level text extraction. GROBID extracts the in-
11
formation from the PDF document of an ETD and converts it into a TEI (Text Encoding
Initiative) [1] document. The structure of the TEI document is as shown in Listing 1.
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- ... -->
</teiHeader>
<text>
-<div xmlns="http://www.tei-c.org/ns/1.0">
<head><!--Chapter Name --> </head>
<p> <!-- Chapter Content--></p>
</div>
<front>
<!-- front matter of copy text, if any, goes here -->
</front>
<body>
<!-- body of copy text goes here -->
</body>
<back>
<!-- back matter of copy text, if any, goes here -->
</back>
</text>
</TEI>
Listing 1: Overall structure of a typical TEI document [1]
TEI Guidelines for Electronic Text Encoding and Interchange [1, 18] use XML as a
markup language for representing the structure and semantic features of texts. The com-
prehensive tags oered by XML provide a way for incorporating the entire semantic
structure of the ETD document. The TEI output format does not explicitly dene a chap-
ter tag (<chapter>). Neither does it provide a @type=chapter attribute for the <div>
element. Therefore, due to the lack of explicit tags for the indication of the start or end
of a chapter, chapter level extraction from ETD documents is a dicult task.
12
We use XPath expressions for extracting the chapters from the ETD documents. We
can see in Listing 1 that “chapter name” is generally present in the <head> tag which is
wrapped inside the <div> tag. Therefore, in order to locate the start of a chapter and the
end of the preceding chapter, we need to capture such a pattern of tags from the TEI XML
metadata extracted by GROBID. The detailed evaluation of this method is explained in
the evaluation Section 5.
The steps involved in chapter level text extraction are:
• Convert the ETD document in PDF into TEI XML format by using a web service
provided by GROBID: /api/processFulltextDocument.
• Use the XPath expression /tei:TEI/tei:text/tei:body/tei:div[tei:head]
for the extraction of chapters, and store each chapter in text format [14].
The folder structure after chapter level text extraction is as shown in Figure 4.1.
Figure 4.1: Folder structure of a ETD after chapter level text extraction
Chapter Level Text extraction based on Table of Contents
XPath based text extraction sometimes recognizes each subsection of the document as a
chapter. In order to overcome this drawback, we tried to explore other methods of chapter
13
level text extraction. The Table of Contents provides information about all the sections
and subsections that are present in an ETD document. Along with this information, it
also provides the page numbers on which a user can nd these sections and subsections.
We decided to use the page numbers from the table of contents to track the start and
end of each chapter. This method has a limitation, as most of the ETD documents do not
contain the keyword ‘Chapter’ to distinguish between chapters and their subsections.
PDF parsers do not maintain the inherent formatting of a PDF document (for example,
they skip spacing between paragraphs), and convert it into a single text le. An example
of text output from the parser and the content in the original PDF document is shown in
Figures 4.2 and 4.3. As we can see from Figure 4.3, there is no delimiter in the parsed text
le which can indicate the end of a page. Additionally, the parser does not capture text
from the header or the footer of a document, so the page numbers present in the header
or footer could not be used as an indicator for the start or end of the page in the parsed
text document.
Therefore, when the text is extracted from a PDF document, the mapping of page
numbers to the chapters is lost.
Manual Chapter Level Extraction
Apart from exploring various other techniques like OCR on the basis of font size, we did
a manual chapter level extraction from 21 ETD documents. This method gives us a gold
standard result. The detailed evaluation of the XPath based method (Section 4.2.1) with
the Manual Level Text Extraction on various parameters is discussed in the evaluation
Section 5. These documents are submitted to the Text Analysis and Machine Learning
Team for solving the big data summarization problem.
4.2.2 TF-IDF Calculation
Term frequency–inverse document frequency (TF-IDF) is calculated to help the Text
Analysis and Machine Learning team to perform related analysis and calculation. As
a weighting technique commonly used in text mining [16], TF-IDF characterizes the im-
portance of a term in a document by calculating the term frequency and the number
of documents in which the term appears. The TF-IDF value can be calculated by using
14
Figure 4.2: Sample ETD Introduction chapter
15
Figure 4.3: Parsed text of the same document (highlighted text indicates end of
page shown in Figure 4.2)
16
Equation 4.1.
T f Id fi,j = T∑fi,j × Id fini,j= × { |D | } (4.1)loд
k nk,j  j : t ∈ d i j
Here ni,j is the number of occurrences of termi in the document dj , and D is the total
number of documents in the corpus.
Initially, we convert all ETD PDF documents to text format. Then a Python script
reads these documents to calculate TF-IDF according to Equation 4.1. The TF-IDF rep-
resentation is implemented using gensim [15], a Python library, which indexes the doc-
uments and saves the indexes and TF-IDF vectors as key-value pairs. So users need to
provide the index of one document to obtain the corresponding TF-IDF vector. To avoid
this complicated process, we provide an optional toolkit in which the user needs to enter
the path to the saved TF-IDF le and the name of the document in order to obtain its
corresponding TF-IDF vector.
As shown in Figure 4.4, the TF-IDF output of each document saved in gensim format
is a list of tuples. The rst element of each tuple is the index of one term, while the second
element is its corresponding TF-IDF value. The gensim TF-IDF method takes the bag-of-
words (BOW) of each document as input. As shown in Figure 4.5, the format for BOW is
similar to that of the TF-IDF module. However, the second element of each tuple is the
frequency of the term in the document. In addition, the BOW of whole ETD documents
is indexed. A dictionary, which gives the corresponding index of the documents, is also
provided. Part of this dictionary is shown in Figure 4.6.
Figure 4.4: Part of TF-IDF of one document
17
Figure 4.5: Part of BOW of one document
Figure 4.6: Part of doc-index dictionary
4.2.3 Transforming Metadata for Ingestion in Elasticsearch
Elasticsearch ingests data in bulk as well as one by one. The bulk API is far more complex
in terms of the required data format. Hence, we decided to ingest each document one
by one. Elasticsearch ingests data only if it is in a particular format. Elasticsearch can
consume a JSON array only if all the entries of the array are of the same data type, i.e.,
either string or object. By default, GROBID output contains arrays having entries of
mixed data types. For example, in Listing 4.1, description provenance has one entry of
string type and two entries of object type. We have written a Python script that iterates
through the metadata le and converts each entry to the same data type. If there is a
mismatch, all entries are converted to object data type having the key as the immediate
18
parent-key.
Listing 4.1: Raw Metadata extracted from ETD using GROBID
1 "description -provenance": [
2 "Made available in DSpace on 2017-01-06T13:34:0
6Z (GMT). No. of bitstreams1 Bailey_JM_D_201
7.pdf9128042 bytes, checksum7438e886322739e1
7247ed2c907658b0 (MD5) Previous issue date
2017-01-05",
3 {
4 "Author Email": [
5 "jmb@vt.edu"
6 ]
7 },
8 {
9 "Advisor Email": []
10 }
11 ]
4.2.4 Development of an Automated System
The Automated System is a system that performs all of the tasks, from the extraction of
metadata from an ETD document, to its ingestion into Elasticsearch, automatically, for
any new document that has been fed to the system developed by the CS5604 fall 2019
class.
The features of this system are:
• Automated unit testing to ensure that all the development scripts are error-free
• Tests to check whether all the dependent services are running (Thus, Figure 4.9
shows the output of a unit test that checks whether GROBID is running.)
• Validation of generated metadata to ensure that it is in the format that can be in-
gested into Elasticsearch
19
• Automatic extraction and preprocessing of the text from the document
• Automatic merging of metadata of new documents with the existing metadata
The limitations of this system are:
• The system cannot scrape the new data from VTechWorks. (The new data should
be added in a folder called “temp” on ceph)
• The folder structure of an ETD document should be in the format shown in Figure
4.9.
Such automation ensures the proper functionality of the system developed by the
class and also the correctness of data that has been passed to Elasticsearch (ELS), Front
End and Kibana (FEK), and Text Analysis and Machine Learning (TML) teams for further
processing and analysis. The detailed description of unit tests is mentioned in Section 5.
Figure 4.7 shows the workow of the automated system.
20
Figure 4.7: Flow diagram of the automated system
21
Figure 4.8: Folder structure of an ETD
Figure 4.9: GROBID unit test
4.2.5 List of Visualizations to be Provided in the Front End
Visualization Type:
• Type-none: "Dissertation" (Pie Chart)
• Degree-level: "doctoral" (Bar Chart)
• Contributor-department: "Mechanical Engineering" (Pie Chart)
• Year: "2017" (get it from "date-issue") (Bar Chart)
4.2.6 Text Preprocessing
ASCII does not correctly encode all the characters in the PDF les; the text les converted
from these PDF les contain many meaningless and wrong characters. These characters
may have a negative impact on the query process. To address this problem, the stop
words are removed using the "corpus" package in NLTK [12].
The other issue is about numbers and garbage data characters that appear in the text
les. In general, the numbers shown in ETD les are related to reference numbers and
other numeric values. The reference numbers are not useful for query search; therefore,
we use regular expressions to remove these numbers. The following regular expressions
were used to clean the data:
• "[\d{1,20}]" to remove words with length greater than 20
22
• replace("...","") to remove "..."
• re.sub("[\(\[].*?[\)\]]","") to remove braces
• replace(’b \’ ’,”) to remove byte literal
• encode(’ascii’,’ignore’) to remove non-ASCII characters
Note that this is an optional process. We provide two dierent versions, one that contains
raw data and another one that contains the processed data, which are required by the
Elasticsearch and Test Analysis and Machine Learning teams, respectively.
23
Chapter 5
Evaluation
5.1 Manual Testing
5.1.1 Testing of Chapter Level Text Extraction
In Section 4.2.1, we explained how we use XPath to extract text based on the chapter
level. We noticed some problems after comparing the results to the chapter-wise results
extracted from ETDs manually. We use Justin Mark Bailey’s dissertation “Full Scale Ex-
perimental Transonic Fan Interaction with a Boundary Layer Ingesting Total Pressure
Distortion” as an example to show the dierences; see Table 5.1 and Figure 5.1. For
XPath based extraction, we counted the rst le for each chapter, as some chapters were
divided into numbers of les. This is why the completeness of XPath based chapter level
extraction technique is low.
24
Figure 5.1: Chapter level text extraction by XPath vs. manual extraction by Di
Checker
Table 5.1: Chapter level text extraction by XPath and manual extraction
XPath Manual
Appendix Just one section Yes
Captions No Yes
Chapter completeness on 43.90% 90.88%
average (calculated by
counting the number of
words)
Yes but lots of illegal
Formulas No
characters
Headers No Repeated each page
Some letters are
Illegal characters No
converted to {cid:}
References in-text No Yes
References No Yes
Space between sentence No Yes
Yes but many illegal
Texts in gures No
characters
25
From Table 5.1 we can see the performance of chapter level text extraction by XPath
is not as good as manual chapter level extraction.
The XPath based technique ignored captions, texts in gures, and formulas which
might include useful information. The percentage of the chapter completeness on av-
erage is a good indicator to show the performance of extractions. Manual extraction
has 90.88% completeness instead of 100% since there are many special characters, gure
captions, and formulas that could not be parsed correctly by the PDF to text parser [4].
However, it still performs much better than the chapter level text extraction by XPath
which has 43.90% for completeness on average. The dierences in number of chapters
generated for 21 ETD documents by two types of extraction methods mentioned in Sec-
tion 4.2.1 are shown in Table 5.3. We can see XPath does not perform well as only one of
the 21 documents has the correct number of chapters.
5.1.2 Testing of Extracted Text Preprocessing
The ETD text les extracted by PDFMiner.six [17] include many incorrect characters.
As shown in Figure 5.2, these illegal characters are usually from non-English words. To
remove these garbage characters, we use NLTK to detect and remove them.
Figure 5.2: Original text generated by PDFMiner.six
26
Table 5.2: Dierences between chapter level text extraction by XPath and man-
ually extraction
Document XPath Manual Match
73987 15 5 No
73988 9 7 Close
74003 52 5 No
74047 3 1
74048 36 5 No
74049 46 5 No
74050 75 5 No
74233 5 5 Yes
74234 40 7 No
74235 12 5 No
74236 31 6 No
74237 23 5 No
74238 2 5 No
74239 154 7 No
ETD in slides
74275 13
format
74302 50 7 No
74383 85 5 No
74395 21 5 No
74396 3 1
74398 0 1
74423 31 6 No
27
Figure 5.3: Processed text
In general, the reference numbers of equations and citations are not useful during
processing of search queries. We use regular expressions to remove these characters.
The processed text is shown in Figure 5.3. Long string of characters in the last line of
Figure 5.2 have been removed in Figure 5.3, and the numbers in parentheses have also
been removed.
5.1.3 Metadata Extraction Testing
We prepare a JSON le manually for any given ETD using the list of keys and then run
the tool to extract metadata from the same ETD. We inspect and compare both JSON
les; if all the key-value pairs match, it means that our script to extract metadata using
GROBID is working properly.
5.1.4 Automated Testing
Unit Test
Unit Testing is the rst level of software testing where the smallest testable parts of a
software are tested. This is used to validate that each unit of the software performs as
designed.
A test case is a set of conditions which is used to determine whether a system under
test works correctly.
A test suite is a collection of test cases that are used to test a software program to show
that it has some specied set of behaviours by executing the aggregated tests together.
28
Stub
A stub is an object that holds predened data and uses it to answer calls during tests. It
is used when you can’t or don’t want to involve objects that would answer with real data
or have undesirable side eects.
An example can be an object that needs to grab some data from the database to re-
spond to a method call. Instead of the real object, we introduced a stub and dened what
data should be returned [11].
Unit test cases and their details
Table 5.3: Dierent test case scenarios.
Unit Test Name Description Expected Behaviour
It hits the GROBID service If service is up, test case
testGrobid
status API. passes else fails.
Checks whether les are
If les are present, test case
testInputPDFPath present or not at the ex-
passes else fails.
pected le path.
Tests both the scenarios
If les are present and
where GROBID is up and
testGrobidAndInputPath GROBID is running, test
PDF les are present or not
case passes.
at expected location.
Test whether the extracted
If metadata is present in
metadata is in elastic
testMetaDataFormat suitable format, test case
search acceptable format
passes else it fails.
or not.
29
Chapter 6
User Manual
6.1 Where to Get Data
6.1.1 VTechWorks ETD collection
The Electronic Theses and Dissertations used for the project are available in VTech-
Works, the Virginia Tech institutional repository maintained by University Libraries.
These ETDs are open access and can be viewed and downloaded free of charge.
The following are the links through which the documents can be accessed:
• ETDs: Virginia Tech Electronic Theses and Dissertations:
http://hdl.handle.net/10919/5534
• Masters Theses:
http://hdl.handle.net/10919/9291
• Doctoral Dissertations:
http://hdl.handle.net/10919/11041
For the initial phase, a subset of these documents, documents from the year 2017,
was considered. Metadata extraction, chapter-wise segregation, and full-text extraction
were performed on this subset using GROBID. Metadata – which includes elds such as
author name, title, date of publication, and department – has been extracted and stored
in MongoDB.
30
6.1.2 GitLab Repository
All les required to run the system are present in the Gitlab repository. Figure 6.1 shows
all the les that are available in the repository.
https://code.vt.edu/cs5604/cme
Figure 6.1: GitLab le structure
31
6.1.3 Metadata Extraction and Ingestion in Ceph
The general steps to extract metadata from the ETDs and ingest it onto ceph are given
below.
1. GROBID is used to process the ETD PDF and extract the metadata in XML format.
The container for running GROBID is available at the following IP address:
http://2001.0468.0c80.6102.0001.7015.d574.516b.ip6.name:8070/
Full text as well as header processing of ETDs can be performed using the TEI
option.
Figure 6.2: GROBID Container
The GROBID server can also be accessed using a Python client. Figure 6.3 shows
a sample code snippet used to access GROBID through a Python client.
Figure 6.3: Python client to access GROBID
32
2. Elasticsearch requires the data to be in JSON format, but the default output
generated using GROBID is in XML format. Moreover, the JSON le needs to have
a key value for each object and in NDJSON (newline delimited JSON) format, as
mentioned in Section 4.2.3. A Python script (XML2JSONConverter.py) will convert
the XML le generated using GROBID to JSON format compatible for Elasticsearch.
The Sample Metadata Format is as shown in Listing 6.1:
Listing 6.1: Raw Metadata extracted from ETD using GROBID
1 {
2 "format -medium": "ETD",
3 "description -abstract": "Future commercial
transport aircraft will feature more
aerodynamic architectures to accommodate
stringent design goals for higher fuel
efficiency, reduced cruise and taxi NOx
emissions, and reduced noise.",
4 "date -issued": "2017-01-05",
5 "publisher -none": "Virginia Tech",
6 "title -none": "Full Scale Experimental
Transonic Fan Interaction with a Boundary
Layer Ingesting Total Pressure Distortion",
7 "contributor -author": "Bailey, Justin Mark",
8 "contributor -committeemember": [
9 "Dancey, Clinton L",
10 "Lowe, Kevin T",
11 "Wicks, Alfred L",
12 "Ng, Wing Fai"
13 ],
14 "type -none": "Dissertation",
15 "description -degree": "PHD",
16 "degree -discipline": "Mechanical Engineering",
17 "subject -none": [
18 "Experimental Engine Testing",
19 "Distortion",
33
20 "Interaction",
21 "Total Pressure",
22 "Boundary Layer Ingesting"
23 ],
24 "contributor -department": "Mechanical
Engineering",
25 "degree -level": "doctoral",
26 "identifier -uri": "http://hdl.handle.net/10919/
73987",
27 "date -available": "2017-01-06T13:34:06Z",
28 "handle": "73987",
29 "description -provenance": [
30 {
31 "description -provenance -summary": "Made
available in DSpace on 2017-01-06T1
3:34:06Z (GMT). No. of bitstreams1
Bailey_JM_D_2017.pdf9128042 bytes,
checksum7438e886322739e17247ed2c9076
58b0 (MD5) Previous issue date2017
-01-05"
32 },
33 {
34 "Author Email": [
35 "jmb@vt.edu"
36 ]
37 },
38 {
39 "Advisor Email": []
40 }
41 ],
42 "identifier -other": "vt_gsexam:9274",
43 "rights -none": "This item is protected by
copyright and/or related rights. Some uses
of this item may be deemed fair and
permitted by law even without permission
34
from the rights holder(s), or the rights
holder(s) may have licensed the work for use
under certain conditions. For other uses
you need to obtain permission from the
rights holder(s).",
44 "degree -grantor": "Virginia Polytechnic
Institute and State University",
45 "date -accessioned": "2017-01-06T13:34:06Z",
46 "contributor -committeechair": "O'Brien, Walter
F",
47 "degree -name": "PHD"
48 }
A similar output is generated for all the ETDs and a JSON le containing the meta-
data for all the ETDs is created.
3. Another script, AddTextToMetadata.py, will convert the ETD to text and add it as
a eld to the extracted JSON metadata. This will allow for full text search on all
ETD documents.
4. A Python script to ingest the data into ceph has been written by the ELS team. The
data is available at mnt/ceph/cme/metadata_subset.json.
5. DriverScript is also present, to run all the above scripts, to enable all tasks from
metadata extraction to the ingestion in Elasticsearch.
35
Chapter 7
Developer’s Manual
In this chapter, we provide details about our timeline of this project, applications we have
used to communicate in the team, and what we have done. Therefore, we will focus more
on how the project can be used to get the metadata and text extracted.
7.1 Timeline
Figure 7.1 shows the task completion timeline.
36
Figure 7.1: Timeline
7.2 Slack
Our group used slack to communicate with all members in the "cme" channel in Slack.
At the same time, we use the channel called "general" to communicate with other dif-
ferent groups in this project. Figure 7.2 shows the dierent slack channels we used to
communicate with the other teams.
37
Figure 7.2: Slack
7.3 GROBID
To install GROBID in a local computer, use the following instructions.
7.3.1 Install in Ubuntu
Step 1: Update System
apt−g e t update
Step 2: Install JDK
Before installing GROBID on a local computer or empty container, Java JDK Version 1.8
has to be set up already.
apt−g e t −y i n s t a l l openjdk −8− j d k wget unz ip
Step 3: Download and install GROBID in /opt
wget h t t p s : / / g i t h u b . com / k e r m i t t 2 / g r o b i d / a r c h i v e / 0 . 5 . 5 . z i p
unz ip 0 . 5 . 5 . z i p
38
Step 4: Download Gradle Gradle is a dependency required for running GROBID.
wget h t t p s : / / s e r v i c e s . g r a d l e . org / d i s t r i b u t i o n s / g r a d l e −3 . 4 . 1 − b in . z i p
Step 5: Install Gradle
mkdir / opt / g r a d l e
unz ip −d / opt / g r a d l e g r a d l e −3 . 4 . 1 − b in . z i p
e x p o r t PATH=$PATH : / opt / g r a d l e / g r a d l e − 3 . 4 . 1 / b in
After installing everything, Figures 7.3 and 7.4 show what is available in the directories.
Figure 7.3: Files in the Gradle folder
Figure 7.4: Files in the GROBID folder
Step 6: Run GROBID
First, get into directory /opt/grobid-0.5.5, and then run the command below:
. / grad lew run
Step 7: Run GrobidcURL.py
Once GROBID is running, call the command below to run the Python le to get the
metadata.
python Grobid_cURLpy
7.4 PDFMiner
Step 1: Install and Test PDFMiner.six
PDFMiner.six is a fork of PDFMiner for Python3.×.
39
p ip i n s t a l l pd fminer . s i x
p d f 2 t x t . py samples / s i m p l e 1 . pdf
Step 2: Run PDFMiner.six
Run PDFMiner.six to extract text:
p d f 2 t x t . py − t type −o o u t p u t f i l e p d f f i l e
Run PDFMiner.six to extract tables:
dumppdf . py −T −o o u t p u t f i l e p d f f i l e
Usage: [-t] denes the output type, such as txt, html and xml. [-o] denes the output path.
7.5 TF-IDF
Step 1: Install Gensim
p ip i n s t a l l gensim
Step 2: Run tf-idf-tool.py
The tf-idf-tool.py is a Python script to read text ETDs and calculate tf-idf values. The
saved tf-idf model is in /mnt/ceph/cme/tf-idf.
python t f − i d f _ t o o l . py
Step 3 (optional): Run use-tdf.py
Run use-tdf.py to check the result.
python use− t f i d f . py
what i s document name : C h i l d r e s s _ T L _ T _ 2 0 1 3 . pd f . t x t
where i s the saved t f − i d f model : / mnt / ceph / cme / t f − i d f / model . t f i d f
where i s the doc t o index d i c t i o n a r y : / mnt / ceph / cme / t f − i d f / d 2 i
where i s the BOW corpus model : / mnt / ceph / cme / t f − i d f / co rpus
40
Chapter 8
Challenges and Limitations
One of our challenges is to extract images, tables, and formulae from PDF. This includes
extraction of both metadata and text. However, we haven’t found a reliable library to
help us reach this point.
Another issue that limits the ETD output data quality is addressed here. For now, the
quality of extracted ETD data relies on the performance of GROBID. However, GROBID
does not always process PDF les well. Hence, the outputs, such as metadata and content,
may be slightly dierent from the original PDF les.
41
Chapter 9
Future Scope
9.1 Improving Chapter Level Text Extraction
Chapter level text extraction can be improved by using various techniques based on OCR.
Such an extraction can used for solving the Big Data Summarization problem for obtain-
ing the summary of each chapter.
9.2 Batch Processing of the Documents
In the future, one can perform batch processing of the ETD Data. Batch processing will
considerably reduce the time required for converting the ETD documents which are in
PDF to a TEI XML format.
9.3 Improving Automation Suite
Loggers can be implemented to log the dierent steps of the automation suite so that it is
easier to understand what is going on in the background. Code coverage can be improved
signicantly. More trigger points can be added to initiate the automation suite to give
additional options to the user. This allows users to choose whether they want to execute
batch processing or use single-threaded processing.
42
Chapter 10
Acknowledgements
The project has been implemented during the course of CS5604, Information Storage
and Retrieval, at Virginia Tech. The data used was the ETDs available on VTechWorks.
We would like to thank Dr. Edward Fox for giving us the opportunity to work on
this interesting and challenging project. We are grateful for his advice and guidance. We
would also like to thank the GTA, Ziqian Song, for her guidance and support throughout
the course project. We thank Bipasha Banerjee for her expertise about the ETD data
and also for guiding us in the proper direction. We thank other teams for their help in
integration, and for sharing their knowledge and insights with us. We also acknowledge
the creators of all the open source tools and software packages and libraries we used
to implement this project. We also thank IMLS for its support of ETD-related research
through grant LG-37-19-0078-19.
43
Bibliography
[1] The TEI Guidelines. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.
html, accessed on Oct. 20, 2019.
[2] Apache Tika, 2007 — 2019. https://tika.apache.org/, accessed on October 12, 2019.
[3] Grobid, 2008 — 2019. https://github.com/kermitt2/grobid, accessed on October 30,
2019.
[4] PyPDF2, May 2014 — 2016. https://pythonhosted.org/PyPDF2/, accessed on October
15, 2019.
[5] Science Parse, 2015 — 2019. https://github.com/allenai/science-parse, accessed on
October 30, 2019.
[6] Ashish, B., Guangchen, L., Beichen, L., and Stephen, L. CS4984/CS5984: Big data
text summarization team 10 etds, 2018. http://hdl.handle.net/10919/86418, accessed
on October 25, 2019.
[7] Elastic. Elasticsearch. https://xebialabs.com/technology/elasticsearch/, accessed
on October 20, 2019.
[8] Farnaz, K., Ashin, M. T., Chinmaya, P., Dhruv, S., and John, A. CS4984/CS5984:
Big data text summarization team 17 etds, 2018. http://hdl.handle.net/10919/86420,
accessed on October 25, 2019.
[9] Glatthorn, A. A., and Joyner, R. L. Writing the winning thesis or dissertation: A
step-by-step guide. Corwin Press, 2005.
[10] Haycock, L. A. Citation analysis of education dissertations for collection develop-
ment. Library Resources & Technical Services 48, 2 (2013), 102–106.
44
[11] Lipski, M. Stub. https://www.softwaretestingmagazine.com/knowledge/unit-
testing-fakes-mocks-and-stubs/, accessed on October 25, 2019.
[12] Loper, E., and Bird, S. NLTK: the natural language toolkit. arXiv preprint cs/0205028
(2002).
[13] Lopez, P. Grobid: Combining automatic bibliographic data recognition and term
extraction for scholarship publications. In Research and Advanced Technology for
Digital Libraries (Berlin, Heidelberg, 2009), M. Agosti, J. Borbinha, S. Kapidakis,
C. Papatheodorou, and G. Tsakonas, Eds., Springer Berlin Heidelberg, pp. 473–474.
[14] Naman, A., Ritesh, B., William, I., Palakh, J., Sampanna, K., and Xinyue, W. Big
data text summarization: Using deep learning to summarize theses and disserta-
tions, 2018. http://hdl.handle.net/10919/86406, accessed on October 25, 2019.
[15] Řehůřek, R., and Sojka, P. Software Framework for Topic Modelling with
Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for
NLP Frameworks (Valletta, Malta, May 2010), ELRA, pp. 45–50. http://is.muni.cz/
publication/884893/en.
[16] Salton, G., and McGill, M. J. Introduction to modern information retrieval.
McGraw-Hill, 1983.
[17] Shinyama, Y. PDFMiner, Oct. 2007. https://github.com/euske/pdfminer.
[18] Sperberg-Mceen, C. M., and Bernard, L., Eds. Guidelines for the encoding and
interchange of machine-readable texts, 1.0 ed. Text Encoding Initiative, Chicago,
1990.
[19] Tkaczyk, D., Collins, A., Sheridan, P., and Beel, J. Evaluation and compari-
son of open source bibliographic reference parsers: A business use case. CoRR
abs/1802.01168 (2018). http://arxiv.org/abs/1802.01168.
[20] Tkaczyk, D., Collins, A., Sheridan, P., and Beel, J. Machine learning vs. rules and
out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference
and citation parsers. arXiv.org (2018).
45
[21] VirginiaTechUniversityLibraries. ETDs: Virginia Tech Electronic Theses and
Dissertations. https://vtechworks.lib.vt.edu/handle/10919/5534, accessed on Oct. 20,
2019.
46