CS4994: Undergraduate Research/Independent Study

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 1 of 1
  • Knowledge Graph Aided Retrieval System for Electronic Theses and Dissertations (ETDs)
    Clemmitt, Keenan; Kondaka, Kashyap; Hill, Andrew (2024-08-14)
    Electronic Theses and Dissertations (ETDs) are digital versions of academic theses and dissertations. These documents exhibit the research and findings of master’s or doctoral-level students. ETDs are typically a requirement for graduation and are accessible online through university repositories or academic databases. ETDs are an integral contribution to scholarly work as they make research accessible to a global audience. Their length, which can range to hundreds of pages, allows inclusion of helpful details, but can be a challenge to readers. This project’s goal was to build on the previous teams’ work by first analyzing the existing machine-learning models and determining how they can be improved. We accomplished this by familiarizing ourselves with optical character recognition (OCR) and object detection (OD). Next, we each generated XML files and analyzed the robustness of the OD model. In the case of errors, we made annotations to seven ETDs to provide training data to further improve the OD model. We studied the existing Postgres database, and how to better integrate it with the knowledge graph (KG). We ran into issues with the API calls responsible for posting the ETD metadata to Postgres, so we had to modify the API calls and restructure the ETD metadata table. Once a valid XML document has been created, it can be analyzed and enhanced with IDs in the Postgres database. This process involves converting the XML file, which contains correctly inserted object IDs, into a JSON file, and subsequently into RDF triple format. These RDF triples are then uploaded to the Virtuoso database to constitute our knowledge graph. The KG stores the objects as nodes, and the edges represent the relationships between the objects. We worked to improve the pipeline from XML to KG, and recommended further work to ensure correctness and scalability.