Browsing by Author "Ahuja, Aman"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- Analyzing and Navigating Electronic Theses and DissertationsAhuja, Aman (Virginia Tech, 2023-07-21)Electronic Theses and Dissertations (ETDs) contain valuable scholarly information that can be of immense value to the scholarly community. Millions of ETDs are now publicly available online, often through one of many digital libraries. However, since a majority of these digital libraries are institutional repositories with the objective being content archiving, they often lack end-user services needed to make this valuable data useful for the scholarly community. To effectively utilize such data to address the information needs of users, digital libraries should support various end-user services such as document search and browsing, document recommendation, as well as services to make navigation of long PDF documents easier. In recent years, with advances in the field of machine learning for text data, several techniques have been proposed to support such end-user services. However, limited research has been conducted towards integrating such techniques with digital libraries. This research is aimed at building tools and techniques for discovering and accessing the knowledge buried in ETDs, as well as to support end-user services for digital libraries, such as document browsing and long document navigation. First, we review several machine learning models that can be used to support such services. Next, to support a comprehensive evaluation of different models, as well as to train models that are tailored to the ETD data, we introduce several new datasets from the ETD domain. To minimize the resources required to develop high quality training datasets required for supervised training, a novel AI-aided annotation method is also discussed. Finally, we propose techniques and frameworks to support the various digital library services such as search, browsing, and recommendation. The key contributions of this research are as follows: - A system to help with parsing long scholarly documents such as ETDs by means of object-detection methods trained to extract digital objects from long documents. The parsed documents can be used for further downstream tasks such as long document navigation, figure and/or table search, etc. - Datasets to support supervised training of object detection models on scholarly documents of multiple types, such as born-digital and scanned. In addition to manually annotated datasets, a framework (along with the resulting dataset) for AI-aided annotation also is proposed. - A web-based system for information extraction from long PDF theses and dissertations, into a structured format such as XML, aimed at making scholarly literature more accessible to users with disabilities. - A topic-modeling based framework to support exploration tasks such as searching and/or browsing documents (and document portions, e.g., chapters) by topic, document recommendation, topic recommendation, and describing temporal topic trends.
- CS5604 Fall 2017 Clustering and Topic AnalysisBaghudana, Ashish; Ahuja, Aman; Bellam, Pavan; Chintha, Rammohan; Sambaturu, Pratyush; Malpani, Ashish; Shetty, Shruti; Yang, Mo (Virginia Tech, 2018-01-13)One of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to build a pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETAR project, in collaboration with the Internet Archive, aims to develop an archive of webpages and tweets related to multiple events and trends that occur in the world, and develop a retrieval system to extract information from that archive. Since it is practically impossible to manually look through all the documents in a large corpus, an important component of any retrieval system is a module that is able to group and summarize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build this component for the GETAR project. Our report examines the various techniques underlying clustering and topic analysis, discusses technology choices and implementation details, and, describes the results of the k-means algorithm and latent Dirichlet allocation (LDA) on different collections of webpages and tweets. Subsequently, we provide a developer manual to help set up our framework, and finally, outline a user manual describing the fields that we populate in HBase.
- Integrated Digital Library System for Long Documents and their ElementsChekuri, Satvik; Chandrasekar, Prashant; Banerjee, Bipasha; Park, Sung Hee; Masrourisaadat, Nila; Ahuja, Aman; Ingram, William A.; Fox, Edward A. (ACM, 2023)We describe a next-generation integrated Digital Library (DL) system that addresses the numerous goals associated with long documents such as Electronic Theses and Dissertations (ETDs). Our extensible workflow-centric design supports a variety of users/personas (e.g., researchers, curators, and experimenters) who can benefit from improved access to ETDs and the content buried therein. Our approach leverages natural language processing, deep learning, information retrieval, and software engineering methods. The services cover ingesting, storing, curating, analyzing, detecting, extracting, classifying, summarizing, topic modeling, browsing, searching, retrieving, recommending, visualizing/reporting, and interacting with ETDs and derivative text/image-based elements/objects. Workflows connect the services and their APIs, along with UI-based access. We believe our approach can guide others to combine tailored user support, research, and education by way of extensible DLs.
- A New Annotation Method and Dataset for Layout Analysis of Long DocumentsAhuja, Aman; Dinh, Kevin; Dinh, Brian; Ingram, William A.; Fox, Edward A. (ACM, 2023-05)Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.