Browsing by Author "Kahu, Sampanna"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Big Data Text Summarization: Using Deep Learning to Summarize Theses and DissertationsAhuja, Naman; Bansal, Ritesh; Ingram, William A.; Jude, Palakh; Kahu, Sampanna; Wang, Xinyue (Virginia Tech, 2018-12-05)Team 16 in the fall 2018 course "CS 4984/5984 Big Data Text Summarization," in partnership with the University Libraries and the Digital Library Research Laboratory, prepared a corpus of electronic theses and dissertations (ETDs) for students to study natural language processing with the power of state-of-the-art deep learning technology. The ETD corpus is made up of 13,071 doctoral dissertations and 17,890 master theses downloaded from the University Libraries’ VTechWorks system. This particular study is designed to explore big data summarization for ETDs, which is a relatively under-explored area. The result of the project will help to address the difficulty of information extraction from ETD documents, the potential of transfer learning on automatic summarization of ETD chapters, and the quality of state-of-the-art deep learning summarization technologies when applied to the ETD corpus. The goal of this project is to generate chapter level abstractive summaries for an ETD collection through deep learning. Major challenges of the project include accurately extracting well-formatted chapter text from PDF files, and the lack of labeled data for supervised deep learning models. For PDF processing, we compare two state of the art scholarly PDF data extraction tools, Grobid and Science-Parse, which generate structured documents from which we can further extract metadata and chapter level text. For the second challenge, we perform transfer learning by training supervised learning models on a labeled dataset of Wikipedia articles related to the ETD collection. Our experimental models include Sequence-to-Sequence and Pointer Generator summarization models. Besides supervised models, we also experiment with an unsupervised reinforcement model, Fast Abstractive Summarization-RL. The general pipeline for our experiments consists of the following steps: PDF data processing and chapter extraction, collecting a training data set of Wikipedia articles, manually creating human generated gold standard summaries for testing and validation, building deep learning models for chapter summarization, evaluating and tuning the models based on results, and then iteratively refining the whole process.
- Classification and extraction of information from ETD documentsAromando, John; Banerjee, Bipasha; Ingram, William A.; Jude, Palakh; Kahu, Sampanna (Virginia Tech, 2020-01-30)In recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning.