Information Storage and Retrieval - CS5604 Final Report Team 1 (Knowledge Graphs)

TR Number

Date

2023-12-04

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

This project presents an ontological framework designed to enhance the utility of Electronic Theses and Dissertations (ETDs) by transforming them into a semantically enriched Knowledge Graph representation. The central technical approach leverages widely accepted standards such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL) to convert ETD metadata into machine-readable RDF triples, each associated with unique Uniform Resource Identifiers (URIs). RDF schema and OWL ontologies are used to explicitly define the classes, properties, and relationships of entities within the ETD domain. By constructing a structured ETD Knowledge Graph, we enable the encoding of rich semantics and interconnections between entities based on these predefined ontologies. A major innovative aspect of this framework is the integration of semantic search capabilities that allow for complex contextual queries by capitalizing on the inherent graph structure. This search functionality relies on the standards-based SPARQL Protocol and RDF Query Language (SPARQL). The query processing mechanism employs graph traversal algorithms, which empower users to perform in-depth exploratory searches, unveiling non-obvious insights and connections from the Knowledge Graph. Our project commenced by analyzing a corpus of roughly 200 ETDs to comprehend the structure and relationships within the constituent components of an ETD. Utilizing this study, we compiled 76 predicates, 7 objects, and 37 subjects which comprehensively encompass the relationships within the ETD in a top-down and bottom-up ontology. The total number of triples generated for the 200 ETDs was roughly 95,000. This brings the average number of triples per ETD to around 600. Based upon this ontology, we process the ETDs which we ingest in an XML format to generate and persist the RDF triples within the Virtuoso database. We provide a SPARQL query interface which is utilized to execute heterogeneous queries upon the repository of this ETD data. To empower our data pipeline and back-end system, we harnessed modern technological architectures including event-driven programming, Representational State Transfer (REST) APIs, version control, web technologies, and Agile methodologies. To harness ancillary metadata for the URIs stored within the Knowledge Graphs, we architected a URI resolution microservice which integrates seamlessly with the central PostgreSQL database. To ensure scalability, our framework utilizes dockerized methodologies, which are deployed within the cloud ecosystem of Virginia Tech that leverages distributed computing techniques. This configuration enables the processing of RDF graphs, even when dealing with very large ETD datasets. The cluster infrastructure is equipped with horizontal scalability, ensuring that it can efficiently handle growing Knowledge Graphs and increasing query workloads. In this context, the OpenLink Virtuoso graph database is utilized to efficiently store and index the ontology-based entities and their relationships. To automate the data pipeline and achieve scalability for data storage in Virtuoso, we aim to use Kafka’s event based architecture. We have laid down the groundwork in terms of design but its implementation could not be completed this semester due to complications within other teams. We also outline the extension for the current version of KG as part of future work. To effectively represent ETDs and capture semantic information, the KG should reflect both intra- and inter-document entity-entity relations. The KG structure, as a directed heterogeneous graph with domain-related semantics, addresses the limitations of transformer models in handling long documents. We propose building an entity-based KG for a document collection by extracting entity-related triples using improved OpenIE and NER methods, enriching the graph for improved coverage and performance in downstream tasks.

Description

Team1Report.pdf is the PDF version of the final report. Team1Report.zip is the Overleaf project version of that report. Team1Presentation.pdf is the PDF version of the final presentation. Team1Presentation.pptx is the PowerPoint version of the final presentation.

Keywords

Knowledge Graph, RDF Triple, Virtuoso, SPARQL Query, ETD, URI Resolution

Citation