Browsing by Author "Ingram, William A."
Now showing 1 - 17 of 17
Results Per Page
Sort Options
- Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and DissertationsChoudhury, Muntabir; Jayanetti, Himarsha R.; Wu, Jian; Ingram, William A.; Fox, Edward (IEEE, 2021-09-27)Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents such as GROBID, CERMINE, and ParsCit, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic baseline and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive1 and a GitHub repository2.
- Big Data Text Summarization: Using Deep Learning to Summarize Theses and DissertationsAhuja, Naman; Bansal, Ritesh; Ingram, William A.; Jude, Palakh; Kahu, Sampanna; Wang, Xinyue (Virginia Tech, 2018-12-05)Team 16 in the fall 2018 course "CS 4984/5984 Big Data Text Summarization," in partnership with the University Libraries and the Digital Library Research Laboratory, prepared a corpus of electronic theses and dissertations (ETDs) for students to study natural language processing with the power of state-of-the-art deep learning technology. The ETD corpus is made up of 13,071 doctoral dissertations and 17,890 master theses downloaded from the University Libraries’ VTechWorks system. This particular study is designed to explore big data summarization for ETDs, which is a relatively under-explored area. The result of the project will help to address the difficulty of information extraction from ETD documents, the potential of transfer learning on automatic summarization of ETD chapters, and the quality of state-of-the-art deep learning summarization technologies when applied to the ETD corpus. The goal of this project is to generate chapter level abstractive summaries for an ETD collection through deep learning. Major challenges of the project include accurately extracting well-formatted chapter text from PDF files, and the lack of labeled data for supervised deep learning models. For PDF processing, we compare two state of the art scholarly PDF data extraction tools, Grobid and Science-Parse, which generate structured documents from which we can further extract metadata and chapter level text. For the second challenge, we perform transfer learning by training supervised learning models on a labeled dataset of Wikipedia articles related to the ETD collection. Our experimental models include Sequence-to-Sequence and Pointer Generator summarization models. Besides supervised models, we also experiment with an unsupervised reinforcement model, Fast Abstractive Summarization-RL. The general pipeline for our experiments consists of the following steps: PDF data processing and chapter extraction, collecting a training data set of Wikipedia articles, manually creating human generated gold standard summaries for testing and validation, building deep learning models for chapter summarization, evaluating and tuning the models based on results, and then iteratively refining the whole process.
- Building A Large Collection of Multi-domain Electronic Theses and DissertationsUddin, Sami; Banerjee, Bipasha; Wu, Jian; Ingram, William A.; Fox, Edward A. (IEEE, 2021-12-15)In this work, we report our progress on building a collection containing over 450k Electronic Theses and Dissertations (ETDs), including full-text and metadata. Our goal is to close the gap of accessibility between long text and short text documents, and to create a new research opportunity for the scholarly community. For that, we developed an ETD Ingestion Framework (EIF) that automatically harvests metadata and PDFs of ETDs from university libraries. We faced multiple challenges and learned many lessons during the process, that led to proposed solutions to overcome/mitigate the limitations of the current data. We also described the data that we have collected. We hope our methods will be useful for building similar collections from university libraries and that the data can be used for research and education.
- Classification and extraction of information from ETD documentsAromando, John; Banerjee, Bipasha; Ingram, William A.; Jude, Palakh; Kahu, Sampanna (Virginia Tech, 2020-01-30)In recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning.
- Ensuring Scholarly Access to Government Archives and RecordsIngram, William A.; Johnson, Sylvester A. (Virginia Tech, 2022-01-31)This report summarizes the activities and outcomes of a collaborative planning project supported by The Andrew W. Mellon Foundation and organized by University Libraries at Virginia Tech, in collaboration with Virginia Tech Center for Humanities and the National Archives and Records Administration (NARA). A diverse group of archivists, librarians, humanists, technologists, information scientists, and computer scientists were convened for a five-part online workshop series to discuss and plan how artificial intelligence and machine learning could be used to ensure public access to the massive and ever-growing collection of government records in the NARA digital catalog. During the workshop, participants identified requirements, developed conceptual models, and discussed a work plan for a subsequent pilot project that would apply state-of-the-art tools and technologies to increase the effectiveness of archival programs and broaden public access to the important content in the NARA catalog. The workshop focused on humanistic and equitability issues of artificial intelligence and developing ethical, human-centered technology that promotes the public good. As such, the topic of intentional mitigation of AI bias was a thread that ran through the entirety of the workshop.
- Ensuring Scholarly Access to Government Archives and Records: A Collaboration of Virginia Tech and the National Archives and Records AdministrationIngram, William A.; Johnson, Sylvester A. (2021-05-19)
- Integrated Digital Library System for Long Documents and their ElementsChekuri, Satvik; Chandrasekar, Prashant; Banerjee, Bipasha; Park, Sung Hee; Masrourisaadat, Nila; Ahuja, Aman; Ingram, William A.; Fox, Edward A. (ACM, 2023)We describe a next-generation integrated Digital Library (DL) system that addresses the numerous goals associated with long documents such as Electronic Theses and Dissertations (ETDs). Our extensible workflow-centric design supports a variety of users/personas (e.g., researchers, curators, and experimenters) who can benefit from improved access to ETDs and the content buried therein. Our approach leverages natural language processing, deep learning, information retrieval, and software engineering methods. The services cover ingesting, storing, curating, analyzing, detecting, extracting, classifying, summarizing, topic modeling, browsing, searching, retrieving, recommending, visualizing/reporting, and interacting with ETDs and derivative text/image-based elements/objects. Workflows connect the services and their APIs, along with UI-based access. We believe our approach can guide others to combine tailored user support, research, and education by way of extensible DLs.
- Maximizing Equitable Reach and Accessibility of ETDsIngram, William A.; Wu, Jian; Fox, Edward A. (ACM, 2023)This poster addresses accessibility issues of electronic theses and dissertations (ETDs) in digital libraries (DLs). ETDs are available primarily as PDF files, which present barriers to equitable access, especially for users with visual impairments, cognitive or learning disabilities, or for anyone needing more efficient and effective ways of finding relevant information within these long documents. We propose using AI techniques, including natural language processing (NLP), computer vision, and text analysis, to convert PDFs into machine-readable HTML documents with semantic tags and structure, extracting figures and tables, and generating summaries and keywords. Our goal is to increase the accessibility of ETDs and to make this important scholarship available to a wider audience.
- MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University LibrariesChoudhury, Muntabir Hasan; Salsabil, Lamia; Jayanetti, Himarsha R.; Wu, Jian; Ingram, William A.; Fox, Edward A. (ACM, 2023)Metadata quality is crucial for discovering digital objects through digital library (DL) interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence (AI) methods to improve the quality of these fields. To evaluate MetaEnhance, we compiled a metadata quality evaluation benchmark containing 500 ETDs, by combining subsets sampled using multiple criteria. We evaluated MetaEnhance against this benchmark and found that the proposed methods achieved nearly perfect F1-scores in detecting errors and F1-scores ranging from 0.85 to 1.00 for correcting five of seven key metadata fields. The codes and data are publicly available on GitHub11https://github.com/lamps-lab/ETDMiner/tree/master/metadata-correction.
- Mining ETDs for Trends in Graduate ResearchIngram, William A. (2020-11-12)
- Multi-tenancy Cloud Access and PreservationTuttle, James; Chen, Yinlin; Jiang, Tingting; Hunter, Lee; Waldren, Andrea; Ghosh, Soumik; Ingram, William A. (ACM, 2020-08)Virginia Tech Libraries has developed a cloud-native, microservervices-based digital libraries platform to consolidate diverse access and preservation infrastructure into a set of flexible, independent microservices in Amazon Web Services. We have been an implementer and contributor to various community digital library and repository projects including DSpace1, Fedora2, and Samvera3. However, the complexity and cost of maintaining disparate application stacks have reduced our capacity to build new infrastructure.
- A Multi-Tenancy Cloud-Native Digital Library PlatformChen, Yinlin; Ingram, William A.; Tuttle, James (2019-06-11)Virginia Tech Libraries presents our next generation digital library platform. Our design and implementation addresses the maintainability, sustainability, modularity, and scalability of a digital repository using a Cloud- native architecture, in which the entire platform is deployed in a cloud environment - Amazon Web Services (AWS). Our next-gen digital library eschews the old model of multiple siloed systems and embraces a common, sustainable infrastructure. This approach facilitates a more maintainable approach to managing and providing access to collections allowing us to focus on content and user experience. This platform is composed of a suite of microservices and cloud services. Microservices implemented as Lambda functions handle specific tasks and communicate with each other and other cloud services using lightweight asynchronous messaging. Cloud-native application development embodies the future of digital asset management and content delivery. Shared infrastructure throughout the stack and a clear demarcation between front- and back-end makes the platform more generalizable and supports independent replacement of components. We share our experiences and lessons learned developing this digital library platform, including architecture design, microservice implementation, cloud integration, best practices, and practical strategies and directions for developing a Cloud-native repository.
- A New Annotation Method and Dataset for Layout Analysis of Long DocumentsAhuja, Aman; Dinh, Kevin; Dinh, Brian; Ingram, William A.; Fox, Edward A. (ACM, 2023-05)Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.
- A Study of Computational Reproducibility using URLs Linking to Open Access Datasets and SoftwareSalsabil, Lamia; Wu, Jian; Choudhury, Muntabir; Ingram, William A.; Fox, Edward A.; Rajtmajer, Sarah; Giles, C. Lee (ACM, 2022-04-25)Datasets and software packages are considered important resources that can be used for replicating computational experiments. With the advocacy of Open Science and the growing interest of investigating reproducibility of scientific claims, including URLs linking to publicly available datasets and software packages has become an institutionalized part of research publications. In this preliminary study, we investigated the disciplinary dependency and chronological trends of including open access datasets and software (OADS) in electronic theses and dissertations (ETDs), based on a hybrid classifier called OADSClassifier, consisting of a heuristic and a supervised learning model. The classifier achieves the best F1 of 0.92.We found that the inclusion of OADS-URLs exhibited a strong disciplinary dependence and the fraction of ETDs containing OADS-URLs has been gradually increasing over the past 20 years.We developed and share a ground truth corpus consisting of 500 manually labeled sentences containing URLs from scientific papers. The dataset and source code are available at https://github.com/lamps-lab/oadsclassifier.
- Teaching Natural Language Processing through Big Data Text Summarization with Problem-Based LearningLi, Liuqing; Geissinger, Jack H.; Ingram, William A.; Fox, Edward A. (Sciendo, 2020)Natural language processing (NLP) covers a large number of topics and tasks related to data and information management, leading to a complex and challenging teaching process. Meanwhile, problem-based learning is a teaching technique specifically designed to motivate students to learn efficiently, work collaboratively, and communicate effectively. With this aim, we developed a problem-based learning course for both undergraduate and graduate students to teach NLP. We provided student teams with big data sets, basic guidelines, cloud computing resources, and other aids to help different teams in summarizing two types of big collections: Web pages related to events, and electronic theses and dissertations (ETDs). Student teams then deployed different libraries, tools, methods, and algorithms to solve the task of big data text summarization. Summarization is an ideal problem to address learning NLP since it involves all levels of linguistics, as well as many of the tools and techniques used by NLP practitioners. The evaluation results showed that all teams generated coherent and readable summaries. Many summaries were of high quality and accurately described their corresponding events or ETD chapters, and the teams produced them along with NLP pipelines in a single semester. Further, both undergraduate and graduate students gave statistically significant positive feedback, relative to other courses in the Department of Computer Science. Accordingly, we encourage educators in the data and information management field to use our approach or similar methods in their teaching and hope that other researchers will also use our data sets and synergistic solutions to approach the new and challenging tasks we addressed.
- Why and How We Went Serverless, and How You Can TooChen, Yinlin; Ingram, William A. (2021-03-15)In this presentation, we will share our experience of adopting serverless techniques and building the next generation of the digital library platform in the AWS cloud. We use this platform to manage complex digital objects and preserve large-scale datasets, which was very challenging for us to build it on-premise on a similar scale in storage, networking, scalability, availability, etc. We further present how serverless removes technical barriers and how we now can take a more precise cost management control, resource utilization, and automation we have never been able to achieve before.
- Why We’re Here: Ensuring Scholarly Access to Government Archives and RecordsIngram, William A. (2021-04-09)