CS5604: Information Storage and Retrieval Collection Management of Electronic Theses and Dissertations Authors Kulendra Kumar Kaushal Rutwik Kulkarni Aarohi Sumant Chaoran Wang Chenhan Yuan Liling Yuan Instructor Dr. Edward A. Fox Department of Computer Science Virginia Tech Blacksburg, VA 24061 December 24, 2019 CS5604: Information Storage and Retrieval Team CME This research was done under the supervision of Dr. Edward A. Fox as part of the course CS5604. 4th edition, December 7, 2019 3rd edition, October 31, 2019 2nd edition, October 10, 2019 1st edition, September 19, 2019 Contents List of Figures vii List of Tables viii 1 Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 VTechWorks ETD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Literature Review 4 2.1 PDF Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Evaluation of Open-Source Bibliographic Reference and Citation Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.3 Big Data Text Summarization . . . . . . . . . . . . . . . . . . . . . 5 2.1.4 GROBID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.5 Science Parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.6 Apache Tika . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.7 PDFMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.8 PyPDF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Requirements 9 3.1 Extract Metadata and Text for ETD Corpus . . . . . . . . . . . . . . . . . 9 3.2 Preprocess the ETD corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 User Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Approach, Design, Implementation 11 4.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 iii 4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2.1 Chapter Level Text Extraction . . . . . . . . . . . . . . . . . . . . . 11 4.2.2 TF-IDF Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2.3 Transforming Metadata for Ingestion in Elasticsearch . . . . . . . 18 4.2.4 Development of an Automated System . . . . . . . . . . . . . . . . 19 4.2.5 List of Visualizations to be Provided in the Front End . . . . . . . . 22 4.2.6 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Evaluation 24 5.1 Manual Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.1 Testing of Chapter Level Text Extraction . . . . . . . . . . . . . . . 24 5.1.2 Testing of Extracted Text Preprocessing . . . . . . . . . . . . . . . 26 5.1.3 Metadata Extraction Testing . . . . . . . . . . . . . . . . . . . . . . 28 5.1.4 Automated Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6 User Manual 30 6.1 Where to Get Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.1.1 VTechWorks ETD collection . . . . . . . . . . . . . . . . . . . . . . 30 6.1.2 GitLab Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.1.3 Metadata Extraction and Ingestion in Ceph . . . . . . . . . . . . . 32 7 Developer’s Manual 36 7.1 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.2 Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.3 GROBID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.3.1 Install in Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.4 PDFMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.5 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 8 Challenges and Limitations 41 9 Future Scope 42 9.1 Improving Chapter Level Text Extraction . . . . . . . . . . . . . . . . . . 42 9.2 Batch Processing of the Documents . . . . . . . . . . . . . . . . . . . . . . 42 9.3 Improving Automation Suite . . . . . . . . . . . . . . . . . . . . . . . . . 42 10 Acknowledgements 43 iv Bibliography 44 v Abstract The class “CS 5604: Information Storage and Retrieval” in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD cor- pus consists of 14,055 doctoral dissertations and 19,246 masters theses from Vir- ginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text an- alytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc. The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF les from the ETD corpus and extracting the well-formatted text les from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing dierent parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We nally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correct- ness of the data that ows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch. vi List of Figures 1.1 Position in entire system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 The architecture of PDF Miner . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 Folder structure of a ETD after chapter level text extraction . . . . . . . . 13 4.2 Sample ETD Introduction chapter . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Parsed text of the same document (highlighted text indicates end of page shown in Figure 4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.4 Part of TF-IDF of one document . . . . . . . . . . . . . . . . . . . . . . . . 17 4.5 Part of BOW of one document . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.6 Part of doc-index dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.7 Flow diagram of the automated system . . . . . . . . . . . . . . . . . . . . 21 4.8 Folder structure of an ETD . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.9 GROBID unit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1 Chapter level text extraction by XPath vs. manual extraction by Di Checker 25 5.2 Original text generated by PDFMiner.six . . . . . . . . . . . . . . . . . . . 26 5.3 Processed text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.1 GitLab le structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.2 GROBID Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.3 Python client to access GROBID . . . . . . . . . . . . . . . . . . . . . . . 32 7.1 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.2 Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.3 Files in the Gradle folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.4 Files in the GROBID folder . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 vii List of Tables 2.1 Human assessment of GROBID and Science Parse outputs . . . . . . . . . 6 5.1 Chapter level text extraction by XPath and manual extraction . . . . . . . 25 5.2 Dierences between chapter level text extraction by XPath and manually extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.3 Dierent test case scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . 29 viii Chapter 1 Introduction 1.1 Overview As a leading global research university, on January 1, 1997, Virginia Tech was the rst university which required graduate students to submit electronic theses and dissertations (ETDs) [21]. As of 2019, the local ETD dataset covers over 33,000 doctoral dissertations or masters theses. ETDs are valuable information sources, but due to the lack of discoverability, they are still underutilized. Hence, retrieving ETDs is important for researchers and universities. Retrieving specic information from academic materials has many important applica- tions, such as citation analysis [10]. It could aid those working to prepare award-winning theses [9]. One of the most important problems in ETD information retrieval is how to extract text and metadata properly from PDF les. In this report, we will address that problem, and also tackle problems related to the identication and extraction of sections and chapters. We hope our work would help future researchers to be able to discover and reuse the potential useful resources from the ETDs. The position of our team in the whole system is shown in Fig. 1.1. Many dierent PDF parsers [3, 5, 17] are implemented to convert PDF les to a structured format, e.g., XML or JSON. To extract metadata or elements – like aliation, tables, and images – from ETDs successfully, we also propose a new approach to avoid errors during conversion. Moreover, the issue of automatic segmentation to identify sections and chapters is also addressed in this project. 1 Figure 1.1: Position in entire system 1.2 VTechWorks ETD Dataset The ETD corpus is downloaded from the Virginia Tech institutional repository, VTech- Works, and consists of over 33,000 documents: 14,055 doctoral dissertations, 19,246 mas- ters theses, and some award-wining and undergraduate theses. The repository is main- tained by the university library, and includes ETDs about all disciplines from all depart- ments of Virginia Tech. For each ETD, there is one PDF document which is generally the main part, a metadata record, and some supporting documents. For older ETDs, the PDF 2 les resulted from scanned paper documents. In such cases, full-text les were extracted using optical character recognition. 1.3 Problem Denition This project works on managing ETDs by answering the following research questions. RQ1: Can we extract metadata from an ETD document, and transform it into a format that can be ingested into Elasticsearch? Elasticsearch is a search server based on the Lucene library. Lucene is a open-source search engine software library. Elasticsearch provides a distributed, multi-tenant- capable full-text search engine with a RESTful web interface and schema-free JSON documents [7]. Enhancing our output to generate according to a suitable format that can be ingested by Elasticsearch should extend the applicability of our work. RQ2: Can we extract text les from PDF les and have content suitable for subsequent indexing and searching? A suitable structure, properly populated with text that is used in the subsequent indexing, would help future researchers to discover and retrieve the specic information they need. RQ3: Can we expand the extracted data by including a le for each chapter? Sometimes the researchers might be just interested in some specic sections. This action might be helpful to increase search specicity and save time for users. RQ4: Can we develop an automated system that can extract the metadata from new documents, process it, and ingest it to Elasticsearch? New ETDs need to be added to our system as and when they are added to VTechWorks. So, in order to make our system more robust and up to date, an automated system to process and add the new ETDs to our system is necessary. 3 Chapter 2 Literature Review 2.1 PDF Processing 2.1.1 Overview All of our electronic theses and dissertations are available as PDF les. It is dicult to extract the key data from such a le. Additionally, the formatting of dierent sections, as well as of the bibliography, changes from document to document. Thus, parsing a PDF le becomes a big challenge. Preprocessing and extraction of metadata from the ETDs are important steps in re- lated works that have been carried out in this domain. The rest of this chapter includes descriptions of some of the work done by researchers related to the extraction of meta- data, text parsing, and providing support for big data text summarization. We include descriptions of popular tools and parsers, and highlight the comparison between them on dierent parameters, as discussed in various works. 2.1.2 Evaluation of Open-Source Bibliographic Reference and Ci- tation Parsers The growth in the volume of available scientic literature has resulted in a scientic information overload problem, which refers to the end user being overwhelmed by the abundance of information. To leverage the information available in that literature, there is a need for intelligent information retrieval systems to provide desired information in an organised manner. 4 One such type of information is machine-readable rich bibliographic metadata. As a consequence, there is demand for tools which can parse scientic documents and ex- tract the bibliographic content. Researchers have devised interesting solutions. Regular expressions, template matching, knowledge bases, and supervised machine learning all relate to solutions proposed. Software tools have been proposed, such as Biblio (regular expression based), Bibpro (template matching based), Citation Parser (knowledge based or rule based), and GROBID (ML or machine learning based) [20]. The quality, measured using precision, of machine learning (ML) based tools, is similar to that of tools employ- ing rules, regular expressions, or template matching (0.77 for ML-based tools vs. 0.76 for non-ML-based tools). However, ML based tools are popular and are often preferred because of also achieving higher recall (0.66 vs. 0.22) [20]. Only a few tools like GROBID (F1=0.89), Cermine (F1=0.83), and ParsCit (F1=0.75) have performed reasonably well. Re- training with task-specic data denitely increases the performance of almost all of the tools. Thus, the F1 measure of GROBID increased by 3% (0.89 to 0.92), Cermine achieved F1 increases of 11% (0.83 to 0.92), and ParsCit had an increase of F1 by 16% (0.75 to 0.87) [20]. 2.1.3 Big Data Text Summarization For summarizing Electronic Theses and Dissertations (ETDs), three Fall 2018 student teams in Virginia Tech CS4984/5984 (Big Data Text Summarization) [14, 6, 8] used Science Parse and GROBID to extract information from PDFs. Both GROBID and Science Parse have their respective pros and cons. Table 2.1 summarises how GROBID outperforms Science Parse in many situations [21]. 2.1.4 GROBID GROBID (GeneRation Of BIbliographic Data) is a parser which is used to extract meta- data from a PDF document into XML format. GROBID takes the PDF of each scholarly document as input and makes use of machine learning models (cascading of linear-chain CRF) for extracting the metadata from the document in XML format. It uses the lexical (POS), layout (font, font size), and position information (start/end) of a line in a document in order to train the models and obtain the metadata in the required format. It does not 5 Table 2.1: Human assessment of GROBID and Science Parse outputs GROBID Science Parse Output File XML JSON Format Adds table of contents and list Maintains order of table of Table of of gures at the end. contents and list of gures. Contents Occasionally misses the Often detects the abstract Abstract abstract. correctly. Occasionally skips chapters especially in case of ETDs of disciplines such as Often skips chapters and Chapters Architecture where there are a merges some chapters together. large number of images present along with the text. Adds a Does not indicate the existence Figures tag to indicate the existence of of a gure; often appends the a gure. gure title as part of the text. Adds a Does not indicate the existence Tables tag to indicate the existence of of a table. a table. Parses the reference string into Parses the reference string into title, author, venue, author - first and last References year. Does not further split name, publication, these values. Skips some volume, issue, published. references while extracting. 6 provide an explicit tag. Therefore, chapter-level text and metadata extraction from the ETD documents is a challenging task using GROBID [3] [13]. 2.1.5 Science Parse Science Parse parses the scientic documents from PDF into structured JSON format. It is a combination of Java and Scala and can be used as a library in any JVM-based language. Science Parse can be used in three dierent ways: • Server: It functions as a wrapper and makes Science Parse available as a web ser- vice. It uses heap memory (about 2GB). • CLI: Science Parse has a command line interface known as RunSP. It uses heap memory (about 6GB). RunSP can also be used to parse multiple les at a time. • Core: It provides exibility in Science Parse but is also quite complex to use as a library. Four model les – general CRF model for extracting title and authors; and a CRF model for each of bibliographies, gazetteer, and word vectors – are available in this service. Science Parse is dicult to set up and sometimes skips or merges some of the content [19][5]. 2.1.6 Apache Tika Apache Tika is a le extraction framework which is written in Java. The big advantage of Tika is that “it can extract over thousands of dierent types of les to metadata and text” [2]. In addition, another powerful capability that Tika has is that this library can extract the image metadata from Portable Document Format (PDF) les. However, it is hard to get the image itself compared to getting the metadata of this image. At the same time, since Apache Tika is written in Java, it is complicated to set it up if users are using other programming languages. Another disadvantage is that Tika can only extract PDF to text, which means chapter-wise extraction is dicult. 2.1.7 PDFMiner PDF Miner.six (or PDFMiner) is a Python-compatible parser that can convert PDF les into text, HTML, or XML. The architecture of PDFMiner is shown as Figure 2.1. As a 7 rule-based parser, PDFMiner runs eciently. Tested with an ETD document, PDFMiner converts PDF to text or other formats using around 18s. Moreover, it supports various font types and CJK language extraction [17]. Practically, it can extract specic pages and tables (output without structure) from a PDF le. However, because PDFMiner is used to extract text data, the ability to process images and tables in PDF les is still unstable according to its document. Figure 2.1: The architecture of PDF Miner 2.1.8 PyPDF2 PyPDF2 is a Python based tool for extraction of metadata and text from a PDF le. It also allows splitting, merging, and extraction of data from the le. Predominantly it is used for the extraction of text from a PDF le. It works on StringIO objects as opposed to le streams and so allows for PDF manipulation in memory [4]. 8 Chapter 3 Requirements In this project, the CME team is responsible for extracting metadata and text from the ETD documents. By the end of this project, we intend to nish the jobs listed below. • Convert ETD documents from PDF to text format to enable full text search. • Extract metadata for each ETD document. • Extract chapter-level text from ETDs. • Preprocess the ETD corpus, i.e., tokenize, lemmatize, and remove stopwords. • Develop a pipeline to enable ingestion of new ETDs into Elasticsearch. 3.1 Extract Metadata and Text for ETD Corpus Metadata containing elds like names of author, date of publication, author email, contributor department, etc. has been extracted and put into ceph (mnt/ceph/cme). It contains both the data of a small ETD dataset subset (i.e., the 2017 ETDs) which includes 691 PDF documents, and the large dataset (all 30K ETDs). Each folder contains PDF as well as text les of the theses/dissertations. 9 3.2 Preprocess the ETD corpus We have performed tokenization and stopword removal on the ETD corpus. This should help the Text Analysis and Machine Learning team to cluster the documents eciently. 3.3 User Support Currently, the IP address of the GROBID server is static. Other users are allowed to extract metadata from PDF les in any environment by using the URL we provided. An auto- mated system is also provided through which a user can run a driver script to implement all the tasks, from extraction of metadata from PDF to its ingestion into Elasticsearch. Details regarding the same are provided in Section 6.1.3. 10 Chapter 4 Approach, Design, Implementation 4.1 Experiment Design This project addresses problems related to management of ETDs by answering the re- search questions that were listed in the problem denition of Section 1.3. ETDs in our database are mostly in the form of PDF documents. The main objective is to parse and extract metadata from the ETDs. However, it is dicult to perform this action on the PDF les since they do not contain tags to delimit their elements. The structures of PDF les are often dierent, and vary according to the domain. To over- come these limitations, suitable machine learning tools need to be used which can extract metadata and represent all the ETDs in the same format. After exploring and evaluating all the mentioned parsers, as discussed in Section 2.1, we decided to use GROBID for extracting metadata. 4.2 Implementation 4.2.1 Chapter Level Text Extraction XPath-based Chapter Level Text Extraction Projects like [14, 6, 8] have successfully used GROBID [3] for capturing the structure of ETD documents. Therefore, due to previous successful usage and ease of installation, we decided to use GROBID for chapter level text extraction. GROBID extracts the in- 11 formation from the PDF document of an ETD and converts it into a TEI (Text Encoding Initiative) [1] document. The structure of the TEI document is as shown in Listing 1. -

Listing 1: Overall structure of a typical TEI document [1] TEI Guidelines for Electronic Text Encoding and Interchange [1, 18] use XML as a markup language for representing the structure and semantic features of texts. The com- prehensive tags oered by XML provide a way for incorporating the entire semantic structure of the ETD document. The TEI output format does not explicitly dene a chap- ter tag (). Neither does it provide a @type=chapter attribute for the
element. Therefore, due to the lack of explicit tags for the indication of the start or end of a chapter, chapter level extraction from ETD documents is a dicult task. 12 We use XPath expressions for extracting the chapters from the ETD documents. We can see in Listing 1 that “chapter name” is generally present in the tag which is wrapped inside the
tag. Therefore, in order to locate the start of a chapter and the end of the preceding chapter, we need to capture such a pattern of tags from the TEI XML metadata extracted by GROBID. The detailed evaluation of this method is explained in the evaluation Section 5. The steps involved in chapter level text extraction are: • Convert the ETD document in PDF into TEI XML format by using a web service provided by GROBID: /api/processFulltextDocument. • Use the XPath expression /tei:TEI/tei:text/tei:body/tei:div[tei:head] for the extraction of chapters, and store each chapter in text format [14]. The folder structure after chapter level text extraction is as shown in Figure 4.1. Figure 4.1: Folder structure of a ETD after chapter level text extraction Chapter Level Text extraction based on Table of Contents XPath based text extraction sometimes recognizes each subsection of the document as a chapter. In order to overcome this drawback, we tried to explore other methods of chapter 13 level text extraction. The Table of Contents provides information about all the sections and subsections that are present in an ETD document. Along with this information, it also provides the page numbers on which a user can nd these sections and subsections. We decided to use the page numbers from the table of contents to track the start and end of each chapter. This method has a limitation, as most of the ETD documents do not contain the keyword ‘Chapter’ to distinguish between chapters and their subsections. PDF parsers do not maintain the inherent formatting of a PDF document (for example, they skip spacing between paragraphs), and convert it into a single text le. An example of text output from the parser and the content in the original PDF document is shown in Figures 4.2 and 4.3. As we can see from Figure 4.3, there is no delimiter in the parsed text le which can indicate the end of a page. Additionally, the parser does not capture text from the header or the footer of a document, so the page numbers present in the header or footer could not be used as an indicator for the start or end of the page in the parsed text document. Therefore, when the text is extracted from a PDF document, the mapping of page numbers to the chapters is lost. Manual Chapter Level Extraction Apart from exploring various other techniques like OCR on the basis of font size, we did a manual chapter level extraction from 21 ETD documents. This method gives us a gold standard result. The detailed evaluation of the XPath based method (Section 4.2.1) with the Manual Level Text Extraction on various parameters is discussed in the evaluation Section 5. These documents are submitted to the Text Analysis and Machine Learning Team for solving the big data summarization problem. 4.2.2 TF-IDF Calculation Term frequency–inverse document frequency (TF-IDF) is calculated to help the Text Analysis and Machine Learning team to perform related analysis and calculation. As a weighting technique commonly used in text mining [16], TF-IDF characterizes the im- portance of a term in a document by calculating the term frequency and the number of documents in which the term appears. The TF-IDF value can be calculated by using 14 Figure 4.2: Sample ETD Introduction chapter 15 Figure 4.3: Parsed text of the same document (highlighted text indicates end of page shown in Figure 4.2) 16 Equation 4.1. T f Id fi,j = T∑fi,j × Id fini,j= × { |D | } (4.1)loд k nk,j j : t ∈ d i j Here ni,j is the number of occurrences of termi in the document dj , and D is the total number of documents in the corpus. Initially, we convert all ETD PDF documents to text format. Then a Python script reads these documents to calculate TF-IDF according to Equation 4.1. The TF-IDF rep- resentation is implemented using gensim [15], a Python library, which indexes the doc- uments and saves the indexes and TF-IDF vectors as key-value pairs. So users need to provide the index of one document to obtain the corresponding TF-IDF vector. To avoid this complicated process, we provide an optional toolkit in which the user needs to enter the path to the saved TF-IDF le and the name of the document in order to obtain its corresponding TF-IDF vector. As shown in Figure 4.4, the TF-IDF output of each document saved in gensim format is a list of tuples. The rst element of each tuple is the index of one term, while the second element is its corresponding TF-IDF value. The gensim TF-IDF method takes the bag-of- words (BOW) of each document as input. As shown in Figure 4.5, the format for BOW is similar to that of the TF-IDF module. However, the second element of each tuple is the frequency of the term in the document. In addition, the BOW of whole ETD documents is indexed. A dictionary, which gives the corresponding index of the documents, is also provided. Part of this dictionary is shown in Figure 4.6. Figure 4.4: Part of TF-IDF of one document 17 Figure 4.5: Part of BOW of one document Figure 4.6: Part of doc-index dictionary 4.2.3 Transforming Metadata for Ingestion in Elasticsearch Elasticsearch ingests data in bulk as well as one by one. The bulk API is far more complex in terms of the required data format. Hence, we decided to ingest each document one by one. Elasticsearch ingests data only if it is in a particular format. Elasticsearch can consume a JSON array only if all the entries of the array are of the same data type, i.e., either string or object. By default, GROBID output contains arrays having entries of mixed data types. For example, in Listing 4.1, description provenance has one entry of string type and two entries of object type. We have written a Python script that iterates through the metadata le and converts each entry to the same data type. If there is a mismatch, all entries are converted to object data type having the key as the immediate 18 parent-key. Listing 4.1: Raw Metadata extracted from ETD using GROBID 1 "description -provenance": [ 2 "Made available in DSpace on 2017-01-06T13:34:0 6Z (GMT). No. of bitstreams1 Bailey_JM_D_201 7.pdf9128042 bytes, checksum7438e886322739e1 7247ed2c907658b0 (MD5) Previous issue date 2017-01-05", 3 { 4 "Author Email": [ 5 "jmb@vt.edu" 6 ] 7 }, 8 { 9 "Advisor Email": [] 10 } 11 ] 4.2.4 Development of an Automated System The Automated System is a system that performs all of the tasks, from the extraction of metadata from an ETD document, to its ingestion into Elasticsearch, automatically, for any new document that has been fed to the system developed by the CS5604 fall 2019 class. The features of this system are: • Automated unit testing to ensure that all the development scripts are error-free • Tests to check whether all the dependent services are running (Thus, Figure 4.9 shows the output of a unit test that checks whether GROBID is running.) • Validation of generated metadata to ensure that it is in the format that can be in- gested into Elasticsearch 19 • Automatic extraction and preprocessing of the text from the document • Automatic merging of metadata of new documents with the existing metadata The limitations of this system are: • The system cannot scrape the new data from VTechWorks. (The new data should be added in a folder called “temp” on ceph) • The folder structure of an ETD document should be in the format shown in Figure 4.9. Such automation ensures the proper functionality of the system developed by the class and also the correctness of data that has been passed to Elasticsearch (ELS), Front End and Kibana (FEK), and Text Analysis and Machine Learning (TML) teams for further processing and analysis. The detailed description of unit tests is mentioned in Section 5. Figure 4.7 shows the workow of the automated system. 20 Figure 4.7: Flow diagram of the automated system 21 Figure 4.8: Folder structure of an ETD Figure 4.9: GROBID unit test 4.2.5 List of Visualizations to be Provided in the Front End Visualization Type: • Type-none: "Dissertation" (Pie Chart) • Degree-level: "doctoral" (Bar Chart) • Contributor-department: "Mechanical Engineering" (Pie Chart) • Year: "2017" (get it from "date-issue") (Bar Chart) 4.2.6 Text Preprocessing ASCII does not correctly encode all the characters in the PDF les; the text les converted from these PDF les contain many meaningless and wrong characters. These characters may have a negative impact on the query process. To address this problem, the stop words are removed using the "corpus" package in NLTK [12]. The other issue is about numbers and garbage data characters that appear in the text les. In general, the numbers shown in ETD les are related to reference numbers and other numeric values. The reference numbers are not useful for query search; therefore, we use regular expressions to remove these numbers. The following regular expressions were used to clean the data: • "[\d{1,20}]" to remove words with length greater than 20 22 • replace("...","") to remove "..." • re.sub("[\(\[].*?[\)\]]","") to remove braces • replace(’b \’ ’,”) to remove byte literal • encode(’ascii’,’ignore’) to remove non-ASCII characters Note that this is an optional process. We provide two dierent versions, one that contains raw data and another one that contains the processed data, which are required by the Elasticsearch and Test Analysis and Machine Learning teams, respectively. 23 Chapter 5 Evaluation 5.1 Manual Testing 5.1.1 Testing of Chapter Level Text Extraction In Section 4.2.1, we explained how we use XPath to extract text based on the chapter level. We noticed some problems after comparing the results to the chapter-wise results extracted from ETDs manually. We use Justin Mark Bailey’s dissertation “Full Scale Ex- perimental Transonic Fan Interaction with a Boundary Layer Ingesting Total Pressure Distortion” as an example to show the dierences; see Table 5.1 and Figure 5.1. For XPath based extraction, we counted the rst le for each chapter, as some chapters were divided into numbers of les. This is why the completeness of XPath based chapter level extraction technique is low. 24 Figure 5.1: Chapter level text extraction by XPath vs. manual extraction by Di Checker Table 5.1: Chapter level text extraction by XPath and manual extraction XPath Manual Appendix Just one section Yes Captions No Yes Chapter completeness on 43.90% 90.88% average (calculated by counting the number of words) Yes but lots of illegal Formulas No characters Headers No Repeated each page Some letters are Illegal characters No converted to {cid:} References in-text No Yes References No Yes Space between sentence No Yes Yes but many illegal Texts in gures No characters 25 From Table 5.1 we can see the performance of chapter level text extraction by XPath is not as good as manual chapter level extraction. The XPath based technique ignored captions, texts in gures, and formulas which might include useful information. The percentage of the chapter completeness on av- erage is a good indicator to show the performance of extractions. Manual extraction has 90.88% completeness instead of 100% since there are many special characters, gure captions, and formulas that could not be parsed correctly by the PDF to text parser [4]. However, it still performs much better than the chapter level text extraction by XPath which has 43.90% for completeness on average. The dierences in number of chapters generated for 21 ETD documents by two types of extraction methods mentioned in Sec- tion 4.2.1 are shown in Table 5.3. We can see XPath does not perform well as only one of the 21 documents has the correct number of chapters. 5.1.2 Testing of Extracted Text Preprocessing The ETD text les extracted by PDFMiner.six [17] include many incorrect characters. As shown in Figure 5.2, these illegal characters are usually from non-English words. To remove these garbage characters, we use NLTK to detect and remove them. Figure 5.2: Original text generated by PDFMiner.six 26 Table 5.2: Dierences between chapter level text extraction by XPath and man- ually extraction Document XPath Manual Match 73987 15 5 No 73988 9 7 Close 74003 52 5 No 74047 3 1 74048 36 5 No 74049 46 5 No 74050 75 5 No 74233 5 5 Yes 74234 40 7 No 74235 12 5 No 74236 31 6 No 74237 23 5 No 74238 2 5 No 74239 154 7 No ETD in slides 74275 13 format 74302 50 7 No 74383 85 5 No 74395 21 5 No 74396 3 1 74398 0 1 74423 31 6 No 27 Figure 5.3: Processed text In general, the reference numbers of equations and citations are not useful during processing of search queries. We use regular expressions to remove these characters. The processed text is shown in Figure 5.3. Long string of characters in the last line of Figure 5.2 have been removed in Figure 5.3, and the numbers in parentheses have also been removed. 5.1.3 Metadata Extraction Testing We prepare a JSON le manually for any given ETD using the list of keys and then run the tool to extract metadata from the same ETD. We inspect and compare both JSON les; if all the key-value pairs match, it means that our script to extract metadata using GROBID is working properly. 5.1.4 Automated Testing Unit Test Unit Testing is the rst level of software testing where the smallest testable parts of a software are tested. This is used to validate that each unit of the software performs as designed. A test case is a set of conditions which is used to determine whether a system under test works correctly. A test suite is a collection of test cases that are used to test a software program to show that it has some specied set of behaviours by executing the aggregated tests together. 28 Stub A stub is an object that holds predened data and uses it to answer calls during tests. It is used when you can’t or don’t want to involve objects that would answer with real data or have undesirable side eects. An example can be an object that needs to grab some data from the database to re- spond to a method call. Instead of the real object, we introduced a stub and dened what data should be returned [11]. Unit test cases and their details Table 5.3: Dierent test case scenarios. Unit Test Name Description Expected Behaviour It hits the GROBID service If service is up, test case testGrobid status API. passes else fails. Checks whether les are If les are present, test case testInputPDFPath present or not at the ex- passes else fails. pected le path. Tests both the scenarios If les are present and where GROBID is up and testGrobidAndInputPath GROBID is running, test PDF les are present or not case passes. at expected location. Test whether the extracted If metadata is present in metadata is in elastic testMetaDataFormat suitable format, test case search acceptable format passes else it fails. or not. 29 Chapter 6 User Manual 6.1 Where to Get Data 6.1.1 VTechWorks ETD collection The Electronic Theses and Dissertations used for the project are available in VTech- Works, the Virginia Tech institutional repository maintained by University Libraries. These ETDs are open access and can be viewed and downloaded free of charge. The following are the links through which the documents can be accessed: • ETDs: Virginia Tech Electronic Theses and Dissertations: http://hdl.handle.net/10919/5534 • Masters Theses: http://hdl.handle.net/10919/9291 • Doctoral Dissertations: http://hdl.handle.net/10919/11041 For the initial phase, a subset of these documents, documents from the year 2017, was considered. Metadata extraction, chapter-wise segregation, and full-text extraction were performed on this subset using GROBID. Metadata – which includes elds such as author name, title, date of publication, and department – has been extracted and stored in MongoDB. 30 6.1.2 GitLab Repository All les required to run the system are present in the Gitlab repository. Figure 6.1 shows all the les that are available in the repository. https://code.vt.edu/cs5604/cme Figure 6.1: GitLab le structure 31 6.1.3 Metadata Extraction and Ingestion in Ceph The general steps to extract metadata from the ETDs and ingest it onto ceph are given below. 1. GROBID is used to process the ETD PDF and extract the metadata in XML format. The container for running GROBID is available at the following IP address: http://2001.0468.0c80.6102.0001.7015.d574.516b.ip6.name:8070/ Full text as well as header processing of ETDs can be performed using the TEI option. Figure 6.2: GROBID Container The GROBID server can also be accessed using a Python client. Figure 6.3 shows a sample code snippet used to access GROBID through a Python client. Figure 6.3: Python client to access GROBID 32 2. Elasticsearch requires the data to be in JSON format, but the default output generated using GROBID is in XML format. Moreover, the JSON le needs to have a key value for each object and in NDJSON (newline delimited JSON) format, as mentioned in Section 4.2.3. A Python script (XML2JSONConverter.py) will convert the XML le generated using GROBID to JSON format compatible for Elasticsearch. The Sample Metadata Format is as shown in Listing 6.1: Listing 6.1: Raw Metadata extracted from ETD using GROBID 1 { 2 "format -medium": "ETD", 3 "description -abstract": "Future commercial transport aircraft will feature more aerodynamic architectures to accommodate stringent design goals for higher fuel efficiency, reduced cruise and taxi NOx emissions, and reduced noise.", 4 "date -issued": "2017-01-05", 5 "publisher -none": "Virginia Tech", 6 "title -none": "Full Scale Experimental Transonic Fan Interaction with a Boundary Layer Ingesting Total Pressure Distortion", 7 "contributor -author": "Bailey, Justin Mark", 8 "contributor -committeemember": [ 9 "Dancey, Clinton L", 10 "Lowe, Kevin T", 11 "Wicks, Alfred L", 12 "Ng, Wing Fai" 13 ], 14 "type -none": "Dissertation", 15 "description -degree": "PHD", 16 "degree -discipline": "Mechanical Engineering", 17 "subject -none": [ 18 "Experimental Engine Testing", 19 "Distortion", 33 20 "Interaction", 21 "Total Pressure", 22 "Boundary Layer Ingesting" 23 ], 24 "contributor -department": "Mechanical Engineering", 25 "degree -level": "doctoral", 26 "identifier -uri": "http://hdl.handle.net/10919/ 73987", 27 "date -available": "2017-01-06T13:34:06Z", 28 "handle": "73987", 29 "description -provenance": [ 30 { 31 "description -provenance -summary": "Made available in DSpace on 2017-01-06T1 3:34:06Z (GMT). No. of bitstreams1 Bailey_JM_D_2017.pdf9128042 bytes, checksum7438e886322739e17247ed2c9076 58b0 (MD5) Previous issue date2017 -01-05" 32 }, 33 { 34 "Author Email": [ 35 "jmb@vt.edu" 36 ] 37 }, 38 { 39 "Advisor Email": [] 40 } 41 ], 42 "identifier -other": "vt_gsexam:9274", 43 "rights -none": "This item is protected by copyright and/or related rights. Some uses of this item may be deemed fair and permitted by law even without permission 34 from the rights holder(s), or the rights holder(s) may have licensed the work for use under certain conditions. For other uses you need to obtain permission from the rights holder(s).", 44 "degree -grantor": "Virginia Polytechnic Institute and State University", 45 "date -accessioned": "2017-01-06T13:34:06Z", 46 "contributor -committeechair": "O'Brien, Walter F", 47 "degree -name": "PHD" 48 } A similar output is generated for all the ETDs and a JSON le containing the meta- data for all the ETDs is created. 3. Another script, AddTextToMetadata.py, will convert the ETD to text and add it as a eld to the extracted JSON metadata. This will allow for full text search on all ETD documents. 4. A Python script to ingest the data into ceph has been written by the ELS team. The data is available at mnt/ceph/cme/metadata_subset.json. 5. DriverScript is also present, to run all the above scripts, to enable all tasks from metadata extraction to the ingestion in Elasticsearch. 35 Chapter 7 Developer’s Manual In this chapter, we provide details about our timeline of this project, applications we have used to communicate in the team, and what we have done. Therefore, we will focus more on how the project can be used to get the metadata and text extracted. 7.1 Timeline Figure 7.1 shows the task completion timeline. 36 Figure 7.1: Timeline 7.2 Slack Our group used slack to communicate with all members in the "cme" channel in Slack. At the same time, we use the channel called "general" to communicate with other dif- ferent groups in this project. Figure 7.2 shows the dierent slack channels we used to communicate with the other teams. 37 Figure 7.2: Slack 7.3 GROBID To install GROBID in a local computer, use the following instructions. 7.3.1 Install in Ubuntu Step 1: Update System apt−g e t update Step 2: Install JDK Before installing GROBID on a local computer or empty container, Java JDK Version 1.8 has to be set up already. apt−g e t −y i n s t a l l openjdk −8− j d k wget unz ip Step 3: Download and install GROBID in /opt wget h t t p s : / / g i t h u b . com / k e r m i t t 2 / g r o b i d / a r c h i v e / 0 . 5 . 5 . z i p unz ip 0 . 5 . 5 . z i p 38 Step 4: Download Gradle Gradle is a dependency required for running GROBID. wget h t t p s : / / s e r v i c e s . g r a d l e . org / d i s t r i b u t i o n s / g r a d l e −3 . 4 . 1 − b in . z i p Step 5: Install Gradle mkdir / opt / g r a d l e unz ip −d / opt / g r a d l e g r a d l e −3 . 4 . 1 − b in . z i p e x p o r t PATH=$PATH : / opt / g r a d l e / g r a d l e − 3 . 4 . 1 / b in After installing everything, Figures 7.3 and 7.4 show what is available in the directories. Figure 7.3: Files in the Gradle folder Figure 7.4: Files in the GROBID folder Step 6: Run GROBID First, get into directory /opt/grobid-0.5.5, and then run the command below: . / grad lew run Step 7: Run GrobidcURL.py Once GROBID is running, call the command below to run the Python le to get the metadata. python Grobid_cURLpy 7.4 PDFMiner Step 1: Install and Test PDFMiner.six PDFMiner.six is a fork of PDFMiner for Python3.×. 39 p ip i n s t a l l pd fminer . s i x p d f 2 t x t . py samples / s i m p l e 1 . pdf Step 2: Run PDFMiner.six Run PDFMiner.six to extract text: p d f 2 t x t . py − t type −o o u t p u t f i l e p d f f i l e Run PDFMiner.six to extract tables: dumppdf . py −T −o o u t p u t f i l e p d f f i l e Usage: [-t] denes the output type, such as txt, html and xml. [-o] denes the output path. 7.5 TF-IDF Step 1: Install Gensim p ip i n s t a l l gensim Step 2: Run tf-idf-tool.py The tf-idf-tool.py is a Python script to read text ETDs and calculate tf-idf values. The saved tf-idf model is in /mnt/ceph/cme/tf-idf. python t f − i d f _ t o o l . py Step 3 (optional): Run use-tdf.py Run use-tdf.py to check the result. python use− t f i d f . py what i s document name : C h i l d r e s s _ T L _ T _ 2 0 1 3 . pd f . t x t where i s the saved t f − i d f model : / mnt / ceph / cme / t f − i d f / model . t f i d f where i s the doc t o index d i c t i o n a r y : / mnt / ceph / cme / t f − i d f / d 2 i where i s the BOW corpus model : / mnt / ceph / cme / t f − i d f / co rpus 40 Chapter 8 Challenges and Limitations One of our challenges is to extract images, tables, and formulae from PDF. This includes extraction of both metadata and text. However, we haven’t found a reliable library to help us reach this point. Another issue that limits the ETD output data quality is addressed here. For now, the quality of extracted ETD data relies on the performance of GROBID. However, GROBID does not always process PDF les well. Hence, the outputs, such as metadata and content, may be slightly dierent from the original PDF les. 41 Chapter 9 Future Scope 9.1 Improving Chapter Level Text Extraction Chapter level text extraction can be improved by using various techniques based on OCR. Such an extraction can used for solving the Big Data Summarization problem for obtain- ing the summary of each chapter. 9.2 Batch Processing of the Documents In the future, one can perform batch processing of the ETD Data. Batch processing will considerably reduce the time required for converting the ETD documents which are in PDF to a TEI XML format. 9.3 Improving Automation Suite Loggers can be implemented to log the dierent steps of the automation suite so that it is easier to understand what is going on in the background. Code coverage can be improved signicantly. More trigger points can be added to initiate the automation suite to give additional options to the user. This allows users to choose whether they want to execute batch processing or use single-threaded processing. 42 Chapter 10 Acknowledgements The project has been implemented during the course of CS5604, Information Storage and Retrieval, at Virginia Tech. The data used was the ETDs available on VTechWorks. We would like to thank Dr. Edward Fox for giving us the opportunity to work on this interesting and challenging project. We are grateful for his advice and guidance. We would also like to thank the GTA, Ziqian Song, for her guidance and support throughout the course project. We thank Bipasha Banerjee for her expertise about the ETD data and also for guiding us in the proper direction. We thank other teams for their help in integration, and for sharing their knowledge and insights with us. We also acknowledge the creators of all the open source tools and software packages and libraries we used to implement this project. We also thank IMLS for its support of ETD-related research through grant LG-37-19-0078-19. 43 Bibliography [1] The TEI Guidelines. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index. html, accessed on Oct. 20, 2019. [2] Apache Tika, 2007 — 2019. https://tika.apache.org/, accessed on October 12, 2019. [3] Grobid, 2008 — 2019. https://github.com/kermitt2/grobid, accessed on October 30, 2019. [4] PyPDF2, May 2014 — 2016. https://pythonhosted.org/PyPDF2/, accessed on October 15, 2019. [5] Science Parse, 2015 — 2019. https://github.com/allenai/science-parse, accessed on October 30, 2019. [6] Ashish, B., Guangchen, L., Beichen, L., and Stephen, L. CS4984/CS5984: Big data text summarization team 10 etds, 2018. http://hdl.handle.net/10919/86418, accessed on October 25, 2019. [7] Elastic. Elasticsearch. https://xebialabs.com/technology/elasticsearch/, accessed on October 20, 2019. [8] Farnaz, K., Ashin, M. T., Chinmaya, P., Dhruv, S., and John, A. CS4984/CS5984: Big data text summarization team 17 etds, 2018. http://hdl.handle.net/10919/86420, accessed on October 25, 2019. [9] Glatthorn, A. A., and Joyner, R. L. Writing the winning thesis or dissertation: A step-by-step guide. Corwin Press, 2005. [10] Haycock, L. A. Citation analysis of education dissertations for collection develop- ment. Library Resources & Technical Services 48, 2 (2013), 102–106. 44 [11] Lipski, M. Stub. https://www.softwaretestingmagazine.com/knowledge/unit- testing-fakes-mocks-and-stubs/, accessed on October 25, 2019. [12] Loper, E., and Bird, S. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002). [13] Lopez, P. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Research and Advanced Technology for Digital Libraries (Berlin, Heidelberg, 2009), M. Agosti, J. Borbinha, S. Kapidakis, C. Papatheodorou, and G. Tsakonas, Eds., Springer Berlin Heidelberg, pp. 473–474. [14] Naman, A., Ritesh, B., William, I., Palakh, J., Sampanna, K., and Xinyue, W. Big data text summarization: Using deep learning to summarize theses and disserta- tions, 2018. http://hdl.handle.net/10919/86406, accessed on October 25, 2019. [15] Řehůřek, R., and Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (Valletta, Malta, May 2010), ELRA, pp. 45–50. http://is.muni.cz/ publication/884893/en. [16] Salton, G., and McGill, M. J. Introduction to modern information retrieval. McGraw-Hill, 1983. [17] Shinyama, Y. PDFMiner, Oct. 2007. https://github.com/euske/pdfminer. [18] Sperberg-Mceen, C. M., and Bernard, L., Eds. Guidelines for the encoding and interchange of machine-readable texts, 1.0 ed. Text Encoding Initiative, Chicago, 1990. [19] Tkaczyk, D., Collins, A., Sheridan, P., and Beel, J. Evaluation and compari- son of open source bibliographic reference parsers: A business use case. CoRR abs/1802.01168 (2018). http://arxiv.org/abs/1802.01168. [20] Tkaczyk, D., Collins, A., Sheridan, P., and Beel, J. Machine learning vs. rules and out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference and citation parsers. arXiv.org (2018). 45 [21] VirginiaTechUniversityLibraries. ETDs: Virginia Tech Electronic Theses and Dissertations. https://vtechworks.lib.vt.edu/handle/10919/5534, accessed on Oct. 20, 2019. 46