CS5604: Information Retrieval

Permanent URI for this collection

https://hdl.handle.net/10919/19081

This collection contains the final projects of the students in various offerings of the course Computer Science 5604: Information Retrieval. This course is taught by Professor Ed Fox. Analyzing, indexing, representing, storing, searching, retrieving, processing and presenting information and documents using fully automatic systems. The information may be in the form of text, hypertext, multimedia, or hypermedia. The systems are based on various models, e.g., Boolean logic, fuzzy logic, probability theory, etc., and they are implemented using inverted files, relational thesauri, special hardware, and other approaches. Evaluation of the systems' efficiency and effectiveness.

Browse

Now showing 1 - 20 of 57

Information Storage and Retrieval - CS5604 Final Report Team 1 (Knowledge Graphs)
Gaugler, Daniel; Aggarwal, Ashish (Virginia Tech, 2023-12-04)
This project presents an ontological framework designed to enhance the utility of Electronic Theses and Dissertations (ETDs) by transforming them into a semantically enriched Knowledge Graph representation. The central technical approach leverages widely accepted standards such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL) to convert ETD metadata into machine-readable RDF triples, each associated with unique Uniform Resource Identifiers (URIs). RDF schema and OWL ontologies are used to explicitly define the classes, properties, and relationships of entities within the ETD domain. By constructing a structured ETD Knowledge Graph, we enable the encoding of rich semantics and interconnections between entities based on these predefined ontologies. A major innovative aspect of this framework is the integration of semantic search capabilities that allow for complex contextual queries by capitalizing on the inherent graph structure. This search functionality relies on the standards-based SPARQL Protocol and RDF Query Language (SPARQL). The query processing mechanism employs graph traversal algorithms, which empower users to perform in-depth exploratory searches, unveiling non-obvious insights and connections from the Knowledge Graph. Our project commenced by analyzing a corpus of roughly 200 ETDs to comprehend the structure and relationships within the constituent components of an ETD. Utilizing this study, we compiled 76 predicates, 7 objects, and 37 subjects which comprehensively encompass the relationships within the ETD in a top-down and bottom-up ontology. The total number of triples generated for the 200 ETDs was roughly 95,000. This brings the average number of triples per ETD to around 600. Based upon this ontology, we process the ETDs which we ingest in an XML format to generate and persist the RDF triples within the Virtuoso database. We provide a SPARQL query interface which is utilized to execute heterogeneous queries upon the repository of this ETD data. To empower our data pipeline and back-end system, we harnessed modern technological architectures including event-driven programming, Representational State Transfer (REST) APIs, version control, web technologies, and Agile methodologies. To harness ancillary metadata for the URIs stored within the Knowledge Graphs, we architected a URI resolution microservice which integrates seamlessly with the central PostgreSQL database. To ensure scalability, our framework utilizes dockerized methodologies, which are deployed within the cloud ecosystem of Virginia Tech that leverages distributed computing techniques. This configuration enables the processing of RDF graphs, even when dealing with very large ETD datasets. The cluster infrastructure is equipped with horizontal scalability, ensuring that it can efficiently handle growing Knowledge Graphs and increasing query workloads. In this context, the OpenLink Virtuoso graph database is utilized to efficiently store and index the ontology-based entities and their relationships. To automate the data pipeline and achieve scalability for data storage in Virtuoso, we aim to use Kafka’s event based architecture. We have laid down the groundwork in terms of design but its implementation could not be completed this semester due to complications within other teams. We also outline the extension for the current version of KG as part of future work. To effectively represent ETDs and capture semantic information, the KG should reflect both intra- and inter-document entity-entity relations. The KG structure, as a directed heterogeneous graph with domain-related semantics, addresses the limitations of transformer models in handling long documents. We propose building an entity-based KG for a document collection by extracting entity-related triples using improved OpenIE and NER methods, enriching the graph for improved coverage and performance in downstream tasks.
Team 2 : Search and Recommendation
Maheshwari, Ujjwal; Khandelwal, Aseem; Ram, Nikhil; Bhamidipati, Harsha; Banuelos, Jason (Virginia Tech, 2023-12-06)
Theses and dissertations represent significant bodies of work accomplished by others, often containing remarkable contributions. The advent of electronic theses and dissertations (ETDs) aimed to simplify the storage and accessibility of these documents. However, their true value is realized when accompanied by an effective system for searching and retrieving specific documents. Our project involved building an Information Retrieval System that supports searching, ranking, browsing and recommendations for a large collection of ETDs. We divided the main goal into two modules - Search and Recommendation. Search is accomplished using Elasticsearch. An overview of the tool is given in the report, along with goals and the implementation process. A recommendation module will provide relevant recommendations for a user, built by experimenting with multiple algorithms in order to obtain the best results. The user manual has been provided for the reference of other groups. The developer manual includes how the project was developed, including architecture, data flow, module overviews, etc. The final report provides an overview of the tasks undertaken, how we planned to achieve our goals, milestones and our timelines. By the project's conclusion, we successfully scaled the system to manage 500K ETDs. Our efforts resulted in enhancements, particularly in bulk indexing and achieving faster response times for searches. Additionally, we refined the existing index schema and implemented a logging mechanism within Elasticsearch to accommodate logs from all collaborating teams.
Team 5 - Infrastructure and DevOps Fall 2023
Adeyemi Aina; Amritha Subramanian; Hung-Wei Hsu; Shalini Rama; Vasundhara Gowrishankar; Yu-Chung Cheng (2024-01-17)
The project aims to revolutionize information retrieval from extensive academic repositories like theses and dissertations by developing an advanced system. Unlike conventional search engines, it focuses on handling complex academic documents. Six dedicated teams oversee different facets: Knowledge Graph, Search and Indexing, Object Detection and Topic Analysis, Language Models, Integration, and User Interaction. The infrastructure and DevOps team is responsible for integration, orchestrates collaborative efforts, manages database access, and ensures seamless communication among components via APIs. The team oversees the container utilization in the CI/CD pipeline, maintains the container cluster, and tailors APIs for specific team needs. Expressing gratitude for previous contributions, the team has made notable progress in migrating to Endeavour, establishing a robust CI/CD pipeline, updating the database schema, tackling Kafka challenges, and deploying authentication services while creating accessible filesystem and database APIs for other teams.
Team 3: Object Detection and Topic Modeling (Fall 2023)
Amr Ahmed Aboelnaga; Anushka Sivakumar; Jayanth Narla; Pradyumna Upendra Dasu; Ragul Seetharaman; Sahana Bhaskar; Shankar Srinidhi Srinivas (2024-01-08)
Under the guidance of Dr. Edward A. Fox, the CS 5604: Information Storage and Retrieval class (Fall 2023) was tasked with developing a cutting-edge information retrieval system to facilitate Electronic Theses and Dissertations (ETDs). We used learning algorithms on a large ETD collection to classify closely related documents. The project’s overarching objective is to enhance the already available service, which enables users to upload, search, and retrieve ETDs along with their associated digital objects in a human-readable format. Our team’s specific assignment is to use object detection and topic modeling to analyze documents and thereby assist in building a system that supports searching and retrieving documents using topics and user defined digital objects, and enables experimenters to conduct further research into objects and topics. To achieve this effort we have implemented object detection on 200 segmented ETDs and topic modeling using BERTopic (BERT embeddings) and LDA (Latent Dirichlet Allocation) on nearly 334k ETDs. The object detection and topic modeling pipelines have been modified to utilize APIs (Application Programming Interfaces) for populating database tables related to ETDs. Each ETD page is converted into an image and stored in the file system, with corresponding entries in the database. Additionally, all detected objects are stored both in the database and the file system. The generated XMLs now include an object ID for each detected object, facilitating the capture of structural relationships using knowledge graphs (Team 1). Efforts have also been invested in enhancing chapter segmentation in XMLs. This involves exploring and experimenting with the LLaMA 2 model, ResNet model, and clustering approaches to accurately identify the start and end pages of chapters.The topic modeling results using BERTopic were not satisfactory, leading to exploration of the LDA model. Switching to the LDA model has provided promising outputs. The topics generated using LDA were refined using various pre-processing techniques and given to team 6 to be used in the sign-up page, and to team 2 for indexing.
User Interface Team Final Presentation and Report CS 5604: Information Storage and Retrieval
Mary Kate McGranahan; Sunday Ubur (Virginia Tech, 2023-12-18)
The Electronic Thesis and Dissertation (ETD) project concept has evolved for years now and has been undergoing improvements across key themes. The Information Retrieval course at Virginia Tech led by Dr. Edward A. Fox has been at the forefront of implementing a robust ETD system. In this submission, we document our contribution to the system as Information Retrieval course students for the Fall 2023 academic year. The overall project tasks were divided into 6 teams. Team 6 was in charge of the User Interface. The state of the system at the beginning of the semester was a fragmented interface with several independent implementations done by students who took this course in the past. Our main task was to merge these into one robust and consistent user interface. We were also tasked with upgrading the loosely implemented authentication system so that users can securely sign up for the system, and interact with User Interface (UI) pages supporting new user roles of curator and experimenter. It was our responsibility to meet these requirements in collaboration with other teams, especially with Team 5, which provided the Application Programming Interfaces (APIs). We accomplished the main goals of one consistent user interface that supports all of the functionality of the ETD system including new user roles and a secure system for signing up and logging in. This submission details these accomplishments, as well as how to use the frontend of the ETD system.
Team 4: Language Models, Classification and Summarization
Naleshwarkar, Kanad; Bhatambarekar, Gayatri; Desai, Zeel; Kumaran, Aishwarya; Haque, Shadab; Srinivasan Manikandan, Adithya Harish (Virginia Tech, 2023-12-17)
The CS5604 class at Virginia Tech has been tasked with developing an information retrieval and analysis system that can handle the collection of data of at least 500,000 Electronic Theses and Dissertations (ETDs), under the direction of Dr. Edward A. Fox. This program should function as a search engine with a variety of capabilities, including browsing, searching, giving suggestions, and rating search results. The class has been split into six teams to execute this job, and each team has been given a specific task. The goal of this report is to provide an overview of Team 4's contribution, which focuses on classification, summarization, and language models. Our prime tasks were testing out various models for classification and summarization. During the course of this project, we evaluated models developed by the previous team working on this task and explored various strategies to improve them. For the classification task, we fine-tuned the SciBERT model to get standardized subject category labels that are in accordance with ProQuest. We also evaluated a large language model, LLaMA 2, for the classification task, and after comparing its performance with the fine-tuned SciBERT model, we observed that LLaMA 2 was not efficient enough for a large-scale system that the class was working on. For summarization, we evaluated summaries generated by various transformer, non-transformer, and LLM-based models. The five models that we evaluated for summarization were TextRank, LexRank, LSA, BigBirdPegasus, and LLaMA 2 7B. We observed that although TextRank and BigBirdPegasus had comparable results, the summaries generated by TextRank were more comprehensive. This experimentation gave us valuable insight into the complexities of processing a large set of documents and performing tasks such as classification and summarization. Additionally, it allowed us to explore the deployment of these models in a production environment to evaluate their performance at scale.
Team 3: Object Detection and Topic Modeling (Objects&Topics) CS 5604 F2022
Devera, Alan; Sahu, Raj; Masrourisaadat, Nila; Amirthalingam, Nirmal; Mao, Chenyu (Virginia Tech, 2023-01-17)
The CS 5604: Information Storage and Retrieval class (Fall 2022), led by Dr. Edward Fox, has been assigned the task of designing and implementing a state-of-the-art information retrieval and analysis system that will support Electronic Theses & Dissertations (ETDs). Given a large collection of ETDs, we want to run different kinds of learning algorithms to categorize them into logical groups, and by the end, be able to suggest to an end-user the documents which are strongly related to the one they are looking for. The overall goal for the project is to have a service that can upload, search, and retrieve ETDs with their derived digital objects, in a human-readable format. Specifically, our team is tasked with analyzing documents using object detection and topic models, with the final deliverable being the Experimenter web page for the derived objects and topics. The object detection team worked with Faster R-CNN and YOLOv7 models, and implemented post-processing rules for saving objects in a structured format. As the final deliverable for object detection, inference on 5k ETDs has been completed, and the refined objects have been saved to the Repository. The topic modeling team worked with clustering ETDs to 10, 25, 50, and 100 topics with different models (LDA, NeuralLDA, CTM, ProdLDA). As the final deliverable for topic modeling, we store the related topics and related documents for 5k ETDs in the Team 1 database, so that Team 2 could provide the related topic and documents on the documents page. By the end of the semester the team was able to deliver the Experimenter web page for the derived objects and topics, and the related objects and topics for 5k ETDs stored in the Team 1 database.
Team 2 for End Users
Paidiparthy, Manoj Prabhakar; Ramanujan, Ramaraja; Teegalapally, Akshita; Muralikrishnan, Madhuvanti; Balar, Romil Khimraj; Juvekar, Shaunak; Murali, Vivek (Virginia Tech, 2023-01-11)
A huge collection of Electronic Theses and Dissertations (ETDs) has valuable information. However, accessing the information from these documents has proven to be challenging as the process is mostly manual. We propose to build a unique Information Retrieval System that will support searching, ranking, browsing, and recommendations for a large collection of ETDs. The system indexes the digital objects related to the ETD, like documents, chapters, etc. The user can then query the indexed objects through a carefully designed web interface. The web interface provides users with utilities to sort, filter, and query specific fields. We have incorporated machine learning models to support semantic search. To enhance user engagement, we provide the user with a list of recommended documents based on the user's actions and topics of interest. A total of 57,130 documents and 21,537 chapters were indexed. The system was tested by the Fall 2022 CS 5604 class, which had 28 members, and was found to fulfill most of the goals set out at the beginning of the semester.
CS5604: Team 1 ETD Collection Management
Jain, Tanya; Bhagat, Hirva; Lee, Wen-Yu; Thukkaraju, Ashrith Reddy; Sethi, Raghav (Virginia Tech, 2023-01-13)
Academic institutions the world over are known to produce hundreds of thousands of ETDs (Electronic Theses and Dissertations) every year. At the end of an academic year, we are left with large volumes of ETD data that are rarely used for further research or ever cited in future work, writings, or publications. As part of the CS5604: Information Storage and Retrieval graduate-level course at Virginia Polytechnic Institute and State University (Virginia Tech), we collectively created a search engine for a collection of more than 500,000 ETDs from academic institutions in the United States, which constitutes the class-wide project. This system enables users to ingest, pre-process, and store ETDs in a repository; apply deep learning models to perform topic modeling, text segmentation, chapter summarization, and classification, backed by a DevOps, user experience and integrations team. We are Team 1 or the “ETD Collection Management” team. During the course of the Fall 2022 semester at Virginia Tech, we were responsible for setting up the repository of ETDs, which encompasses broadly the following three components: (1) setting up a database, (2) storing digital objects in a file system, and (3) creating a knowledge graph. Our work enabled other teams to efficiently retrieve the stored ETD data, and perform appropriate pre-processing operations, and during the final few months of the semester, to apply the aforementioned deep learning models to the ETD collection we created. The key deliverable for Team 1 was to create an interactive user interface to perform CRUD operations (create, retrieve, update, and delete) in order to interact with the repository of ETDs, which is essentially an extrapolation of the work already taken up at Virginia Tech’s Digital Library Research Laboratory. Owing to the fact that the other teams had no direct access to the repository set up by us, we designed a host of Application Programming Interfaces (APIs) which are elaborated in depth in the subsequent sections of the report. The end goal for Team 1 was to be able to set up an accessible repository of ETDs so that they can be used for further research work. This is taking into account how each ETD is a well-curated resource and how it may even prove to be an excellent asset for an in-depth analysis on a certain topic, not limited to academic or research purposes.
CS5604 Fall 2022 - Team 5 INT
Shukla, Anmol; Travasso, Aaron; Manogaran, Harish Babu; Sisodia, Pallavi Kishor; Li, Yuze (Virginia Tech, 2022-01-08)
The primary objective of the project is to build a state-of-the-art system to search and retrieve relevant information effectively from a large corpus of electronic theses and dissertations. The system is targeted towards documents such as academic textbooks, dissertations and theses where the information available is enormous, compared to websites or blogs, which the conventional search engines are equipped to handle effectively. The entire work involved in developing the system has been divided into five areas such as data management (Team-1, Curator); search and retrieval (Team-2, User); object detection and topic analysis (Team-3, Objects & Topics); language models, classification, summarization and segmentation (Team-4, Classification & Summarization); and lastly integration (Team-5, Integration). The teams and their operations are structured in a way to mirror an environment of a company working on new product development. The Integration (INT) team focuses on one of the important aspects such as setting up work environments with all requirements for the teams, integrating the work done by the other four teams, and deploying suitable Docker containers for seamless operation (workflow) along with maintaining the cluster infrastructure. The INT team archives this distribution of code and containers on the Virginia Tech Docker Container Registry and deploys it on the Virginia Tech CS Cloud. The INT team also guides team evaluations of prospective container components and workflows. Additionally, the team implements continuous integration and continuous deployment to enable seamless integration, building and testing of code as they are developed. Furthermore, the team works on setting up a workflow management system that employs Apache Airflow to automate creating, scheduling, and monitoring of workflows. We have created customized containers for each team based on their individual requirements. We have developed a workflow management system using Apache Airflow that creates and manages workflows to achieve the goals of each team such as indexing, object detection, segmentation, summarization, and classification. We have also implemented a Continuous Integration and Continuous Deployment (CI/CD) pipeline to automatically create, test and deploy the updated image whenever a new push is made to a Git repository. Additionally, we extended our support to other teams in troubleshooting the issues they faced in deployment. Our current cluster statistics (i.e., Kubernetes Resource Definitions) are: 45 deployments, 40 ingresses, 39 pods, 180 services, and 13 secrets. Lastly, the INT team would like to express its gratitude to the work of the INT-2020 team and the predecessors who have done substantial work upon which we built. We would like to acknowledge here their significant contribution.
Team 4: Segmentation, Summarization, and Classification
Ganesan, Kaushik; Nanjundan, Deepak; Srivastava, Deval; Neog, Abhilash; Jayaprakash, Dharneeshkar; Shah, Aditya (Virginia Tech, 2023-01-11)
Under the guidance of Dr. Edward A. Fox, the class of CS5604 Fall 2022 semester at Virginia Tech was assigned the task of building an Information Retrieval and Analysis System that can support a collection of at least 5000 Electronic Theses and Dissertations (ETDs). The system would act as a search engine, supporting a number of features, such as searching, providing recommendations, ranking search results, and browsing. In order to achieve this, the class was divided into five teams, each assigned separate tasks with the intent of collaborating through CI/CD. The roles can be described as follows: Content and Representation, End-user Recommendation and Search, Object Detection and Topic Models, Classification and Summarization with Language Models, and Integration and Coordination. The intent of the report is to outline the contribution of Team 4, which focuses on language models, classification, summarization, and segmentation. In this project, Team 4 was successful in reproducing Akbar Javaid Manzoor’s pipeline to segment ETD into chapters, summarize the segmented chapters using extractive and abstractive summarizing techniques, and classify the chapters using deep learning and language models. Using the APIs developed by Team 1, Team 4 was also tasked with storing the outcomes of 5000 ETDs in the file system and database. Team 4 containerized the services and assisted Team 5 with workflow automation to help automate the services. The project’s main lessons were effective team collaboration, efficient code maintenance, containerization of services, upkeep of a CI/CD workflow, and finally effective information storage retrieval at scale. The report describes the goals, tasks, and achievements, along with our coordination with the other teams in completing the higher-level tasks concerning the entire project.
Integration and Implementation (INT) CS 5604 F2020
Hicks, Alexander; Thazhath, Mohit; Gupta, Suraj; Long, Xingyu; Poland, Cherie; Hsieh, Hsinhan; Mahajan, Yash (Virginia Tech, 2020-12-18)
The first major goal of this project is to build a state-of-the-art information storage, retrieval, and analysis system that utilizes the latest technology and industry methods. This system is leveraged to accomplish another major goal, supporting modern search and browse capabilities for a large collection of tweets from the Twitter social media platform, web pages, and electronic theses and dissertations (ETDs). The backbone of the information system is a Docker container cluster running with Rancher and Kubernetes. Information retrieval and visualization is accomplished with containers in a pipelined fashion, whether in the cluster or on virtual machines, for Elasticsearch and Kibana, respectively. In addition to traditional searching and browsing, the system supports full-text and metadata searching. Search results include facets as a modern means of browsing among related documents. The system supports text analysis and machine learning to reveal new properties of collection data. These new properties assist in the generation of available facets. Recommendations are also presented with search results based on associations among documents and with logged user activity. The information system is co-designed by five teams of Virginia Tech graduate students, all members of the same computer science class, CS 5604. Although the project is an academic exercise, it is the practice of the teams to work and interact as though they are groups within a company developing a product. The teams on this project include three collection management groups -- Electronic Theses and Dissertations (ETD), Tweets (TWT), and Web-Pages (WP) -- as well as the Front-end (FE) group and the Integration (INT) group to help provide the overarching structure for the application. This submission focuses on the work of the Integration (INT) team, which creates and administers Docker containers for each team in addition to administering the cluster infrastructure. Each container is a customized application environment that is specific to the needs of the corresponding team. Each team will have several of these containers set up in a pipeline formation to allow scaling and extension of the current system. The INT team also contributes to a cross-team effort for exploring the use of Elasticsearch and its internally associated database. The INT team administers the integration of the Ceph data storage system into the CS Department Cloud and provides support for interactions between containers and the Ceph filesystem. During formative stages of development, the INT team also has a role in guiding team evaluations of prospective container components and workflows. The INT team is responsible for the overall project architecture and facilitating the tools and tutorials that assist the other teams in deploying containers in a development environment according to mutual specifications agreed upon with each team. The INT team maintains the status of the Kubernetes cluster, deploying new containers and pods as needed by the collection management teams as they expand their workflows. This team is responsible for utilizing a continuous integration process to update existing containers. During the development stage the INT team collaborates specifically with the collection management teams to create the pipeline for the ingestion and processing of new collection documents, crossing services between those teams as needed. The INT team develops a reasoner engine to construct workflows with information goal as input, which are then programmatically authored, scheduled, and monitored using Apache Airflow. The INT team is responsible for the flow, management, and logging of system performance data and making any adjustments necessary based on the analysis of testing results. The INT team has established a Gitlab repository for archival code related to the entire project and has provided the other groups with the documentation to deposit their code in the repository. This repository will be expanded using Gitlab CI in order to provide continuous integration and testing once it is available. Finally, the INT team will provide a production distribution that includes all embedded Docker containers and sub-embedded Git source code repositories. The INT team will archive this distribution on the Virginia Tech Docker Container Registry and deploy it on the Virginia Tech CS Cloud. The INT-2020 team owes a sincere debt of gratitude to the work of the INT-2019 team. This is a very large undertaking and the wrangling of all of the products and processes would not have been possible without their guidance in both direct and written form. We have relied heavily on the foundation they and their predecessors have provided for us. We continue their work with systematic improvements, but also want to acknowledge their efforts Ibid. Without them, our progress to date would not have been possible.
CS 5604: Information Storage and Retrieval - Webpages (WP) Team
Barry-Straume, Jostein; Vives, Cristian; Fan, Wentao; Tan, Peng; Zhang, Shuaicheng; Hu, Yang; Wilson, Tishauna (Virginia Tech, 2020-12-18)
The first major goal of this project is to build a state-of-the-art information retrieval engine for searching webpages and for opening up access to existing and new webpage collections resulting from Digital Library Research Laboratory (DLRL) projects relating to eventsarchive.org. The task of the Webpage (WP) team was to provide the functionality of making any archived webpage accessible and indexed. The webpages can be obtained either through event focused crawlers or collections of data, such as WARC files containing webpages, or sets of tweets which contains embedded URLs. Toward completion of the project, the WP team worked on four major tasks: 1.) Contents of WARC files searchable through ElasticSearch. 2.) Contents of WARC files cleaned and searchable through ElasticSearch. 3.) Event focused crawler running and producing WARC files. 4.) Additional extracted/derived information (e.g., dates, classes) made searchable. The foundation of the software is a Docker container cluster employing Airflow, a Reasoner, and Kubernetes. The raw data of the information content of the given webpage collections is stored using the Network File System (NFS), while Ceph is used for persistent storage for the Docker containers. Retrieval, analysis, and visualization of the webpage collection is carried out with ElasticSearch and Kibana, respectively. These two technologies form an Elastic Stack application which serves as the vehicle with which the WP team indexes, maps, and stores the processed data and model outputs with regards to webpage collections. The software is co-designed by 7 team members of Virginia Tech graduate students, all members of the same computer science class, CS 5604: Information Storage and Retrieval. The course is taught by Professor Edward A. Fox. Dr. Fox structures the class in a way for his students to perform in a “mock” business development setting. In other words, the academic project submitted by the WP team for all intents and purposes can be viewed as a microcosm of software development within a corporate structure. This submission focuses on the work of the WP team, which creates and administers Docker containers such that various services are tested and deployed in whole. Said services pertain solely to the ingestion, cleansing, analysis, extraction, classification, and indexing of webpages and their respective content.
CS5604 (Information Retrieval) Fall 2020 Front-end (FE) Team Project
Cao, Yusheng; Mazloom, Reza; Ogunleye, Makanjuola (Virginia Tech, 2020-12-16)
With the demand and abundance of information increasing over the last two decades, generations of computer scientists are trying to improve the whole process of information searching, retrieval, and storage. With the diversification of the information sources, users' demand for various requirements of the data has also changed drastically both in terms of usability and performance. Due to the growth of the source material and requirements, correctly sorting, filtering, and storing has given rise to many new challenges in the field. With the help of all four other teams on this project, we are developing an information retrieval, analysis, and storage system to retrieve data from Virginia Tech's Electronic Thesis and Dissertation (ETD), Twitter, and Web Page archives. We seek to provide an appropriate data research and management tool to the users to access specific data. The system will also give certain users the authority to manage and add more data to the system. This project's deliverable will be combined with four others to produce a system usable by Virginia Tech's library system to manage, maintain, and analyze these archives. This report attempts to introduce the system components and design decisions regarding how it has been planned and implemented. Our team has developed a front end web interface that is able to search, retrieve, and manage three important content collection types: ETDs, tweets, and web pages. The interface incorporates a simple hierarchical user permission system, providing different levels of access to its users. In order to facilitate the workflow with other teams, we have containerized this system and made it available on the Virginia Tech cloud server. The system also makes use of a dynamic workflow system using a KnowledgeGraph and Apache Airflow, providing high levels of functional extensibility to the system. This allows curators and researchers to use containerised services for crawling, pre-processing, parsing, and indexing their custom corpora and collections that are available to them in the system.
CS 5604 2020: Information Storage and Retrieval TWT - Tweet Collection Management Team
Baadkar, Hitesh; Chimote, Pranav; Hicks, Megan; Juneja, Ikjot; Kusuma, Manisha; Mehta, Ujjval; Patil, Akash; Sharma, Irith (Virginia Tech, 2020-12-16)
The Tweet Collection Management (TWT) Team aims to ingest 5 billion tweets, clean this data, analyze the metadata present, extract key information, classify tweets into categories, and finally, index these tweets into Elasticsearch to browse and query. The main deliverable of this project is a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects. As a starting point, we focused on two development goals: (1) hashtag-based and (2) username-based search for tweets. For IR1, we completed extraction of two fields within our sample collection: hashtags and username. Sample code for TwiRole, a user-classification program, was investigated for use in our project. We were able to sample from multiple collections of tweets, spanning topics like COVID-19 and hurricanes. Initial work encompassed using a sample collection, provided via Google Drive. An NFS-based persistent storage was later involved to allow access to larger collections. In total, we have developed 9 services to extract key information like username, hashtags, geo-location, and keywords from tweets. We have also developed services to allow for parsing and cleaning of raw API data, and backup of data in an Apache Parquet filestore. All services are Dockerized and added to the GitLab Container Registry. The services are deployed in the CS cloud cluster to integrate services into the full search engine workflow. A service is created to convert WARC files to JSON for reading archive files into the application. Unit testing of services is complete and end-to-end tests have been conducted to improve system robustness and avoid failure during deployment. The TWT team has indexed 3,200 tweets into the Elasticsearch index. Future work could involve parallelization of the extraction of metadata, an alternative feature-flag approach, advanced geo-location inference, and adoption of the DMI-TCAT format. Key deliverables include a data body that allows for search, sort, filter, and visualization of raw tweet collections and metadata analysis; a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects; and a user guide to assist those using the system.
CS5604 Fall 2020: Electronic Thesis and Dissertation (ETD) Team
Fan, Jiahui; Hardy, Nicolas; Furman, Samuel; Manzoor, Javaid; Nguyen, Alexander; Raghuraman, Aarathi (Virginia Tech, 2020-12-16)
The Fall 2020 CS 5604 (Information Storage and Retrieval) class, led by Dr. Edward Fox, is building an information retrieval and analysis system that supports electronic theses and dissertations, tweets, and webpages. We are the Electronic Thesis and Dissertation Collection Management (ETD) team. The Virginia Tech Library maintains a corpus of 19,779 theses and 14,691 dissertations within the VTechWorks system. Our task was to research this data, implement data preprocessing and extraction techniques, index the data using Elasticsearch, and use machine learning techniques for each ETD's classification. These efforts were made in concert with teams working to process other content areas, and build content agnostic infrastructure. Prior work towards these tasks had been done in previous semesters of CS5604, and by students advised by Dr. Fox. That prior work serves as a foundation for our own work. Items of note were the works of Sampanna Kahu, Palakh Jude, and the Fall 2019 CS5604 CME team, which have been used either as part of our pipeline, or as the starting point for our work. Our team divided the creation of an ETD IR system into five subtasks: verify metadata of past teams, ingest text, index using Elasticsearch, extract figures and tables, and classify full documents and chapters. Each member of the team was assigned to a role, and a timeline was created to keep everyone on track. Text ingestion was done via ITCore and is accessible through an API. Our team did not perform any base metadata extraction since the Fall 2019 CS5604 CME had already done so, however we did still verify the quality of the data. Verification was done by hand and showed that most of the errors found in metadata from previous semesters were minor, but there were a few errors that could have lead to misclassification. However, since those major errors were few and far between, we decided that given the length of the project we could continue to use this metadata and added improved metadata extraction to our future goals. For figure and text extraction, we incorporated the work of Kahu. For classification, we first implemented the work of Jude, who has done previous work related to chapter classification. In addition, we created another classifier that is more accurate. Both methods are available as accessible APIs. The latter classifier is also available as a microservice. In addition, an Elasticsearch service was created as a single point of contact between the pipeline and Elasticsearch. It acts as the final part of the pipeline; the processing is complete when the document is indexed into Elasticsearch. The final deliverable is a pipeline of containerized services that can be used to store and index ETDs. Staging of the pipeline was handled by the integration team using Airflow and a reasoner engine to control resource management and data flow. The entire pipeline is then accessible via a website created by the frontend team. Given that Airflow defines the application pipeline based on dependencies between services and our chapter extraction service failed to build due to MySQL dependencies, we were unable to deploy an end-to-end Airflow system. All other services have been unit tested on Git Runner, containerized, and deployed to cloud.cs.vt.edu following a CI/CD pipeline. Future work includes expanding the available models for classification; expanding the available options for the extraction of text, figures, tables, and chapters; and adding more features that may be useful to researchers who would be interested in leveraging this pipeline. Another improvement would be to tackle some of the errors in metadata, such as that from previous teams.
Integration and Implementation (INT) CS5604 Fall 2019
Agarwal, Rahul; Albahar, Hadeel; Roth, Eric; Sen, Malabika; Yu, Lixing (Virginia Tech, 2019-12-11)
The first major goal of this project is to build a state-of-the-art information storage, retrieval, and analysis system that utilizes the latest technology and industry methods. This system is leveraged to accomplish the second major goal, supporting modern search and browse capabilities for two major content collections: (1) 200,000 ETDs (electronic theses and dissertations), and (2) 14 million settlement documents from the lawsuit wherein 39 U.S. states sued the major tobacco companies. The backbone of the information system is a Docker container cluster running with Rancher and Kubernetes. Information retrieval and visualization is accomplished with containers for Elasticsearch and Kibana, respectively. In addition to traditional searching and browsing, the system supports full-text and metadata searching. Search results include facets as a modern means of browsing among related documents. The system exercises text analysis and machine learning to reveal new properties of collection data. These new properties assist in the generation of available facets. Recommendations are also presented with search results based on associations among documents and with logged user activity. The information system is co-designed by 6 teams of Virginia Tech graduate students, all members of the same computer science class, CS 5604. Although the project is an academic exercise, it is the practice of the teams to work and interact as though they are groups within a company developing a product. These are the teams on this project: Collection Management ETDs (CME), Collection Management Tobacco Settlement Documents (CMT), Elasticsearch (ELS), Front-end and Kibana (FEK), Integration and Implementation (INT), and Text Analysis and Machine Learning (TML). This submission focuses on the work of the Integration (INT) team, which creates and administers Docker containers for each team in addition to administering the cluster infrastructure. Each container is a customized application environment that is specific to the needs of the corresponding team. For example, the ELS team container environment shall include Elasticsearch with its internal associated database. INT also administers the integration of the Ceph data storage system into the CS Department Cloud and provides support for interactions between containers and Ceph. During formative stages of development, INT also has a role in guiding team evaluations of prospective container components. Beyond the project formative stages, INT has the responsibility of deploying containers in a development environment according to mutual specifications agreed upon with each team. The development process is fluid. INT services team requests for new containers and updates to existing containers in a continuous integration process until the first system testing environment is completed. During the development stage INT also collaborates with the CME and CMT teams on the data pipeline subsystems for the ingestion and processing of new collection documents. With the testing environment established, the focus of the INT team shifts toward gathering of system performance data and making any systemic adjustments necessary based on the analysis of testing results. Finally, INT provides a production distribution that includes all embedded Docker containers and sub-embedded Git source code repositories. INT archives this distribution on Docker Hub and deploys it on the Virginia Tech CS Cloud.
Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019
Kaushal, Kulendra Kumar; Kulkarni, Rutwik; Sumant, Aarohi; Wang, Chaoran; Yuan, Chenhan; Yuan, Liling (Virginia Tech, 2019-12-23)
The class ``CS 5604: Information Storage and Retrieval'' in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD corpus consists of 14,055 doctoral dissertations and 19,246 masters theses from Virginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text analytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc. The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF files from the ETD corpus and extracting well-formatted text files from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing different parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We finally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correctness of the data that flows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch.
Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019
Muhundan, Sushmethaa; Bendelac, Alon; Zhao, Yan; Svetovidov, Andrei; Biswas, Debasmita; Marin Thomas, Ashin (Virginia Tech, 2019-12-11)
Consumption of tobacco causes health issues, both mental and physical. Despite this widely known fact, tobacco companies had sustained their huge presence in the market over the past century owing to a variety of successful marketing strategies. This report documents the work of the Collection Management Tobacco Settlement Documents (CMT) team, the data ingestion team for the tobacco documents. We deal with an archive of tobacco documents that were produced during litigation between the United States and seven major tobacco industry organizations. Our aim is to process these documents and assist Dr. David M. Townsend, an assistant professor at Virginia Polytechnic Institute and State University (Virginia Tech) Pamplin College of Business, in his research towards understanding the marketing strategies of the tobacco companies. The team is part of a larger initiative: to build a state-of-the-art information retrieval and analysis system. We handle over 14 million tobacco settlement documents as part of this project. Our tasks include extracting the data as well as metadata from these documents. We cater to the needs of the ElasticSearch (ELS) team and the Text Analytics and Machine Learning (TML) team. We provide tobacco settlement data in suitable formats to enable them to process and feed the data into the information retrieval system. We have successfully processed both the metadata and the document texts into a usable format. For metadata, this involved collaborating with the above-mentioned teams to come up with a suitable format. We retrieved the metadata from a MySQL database and converted it into a JSON for Elasticsearch ingestion. For the data, this involved lemmatization, tokenization, and text cleaning. We have supplied the entire dataset to the ELS and TML teams. Data, as well as metadata of these documents, were cleaned and provided. Python scripts were used to query the database and output the results in the required format. We also closely interacted with Dr. Townsend to understand his research needs in order to guide the Front-end and Kibana (FEK) team in terms of insights about features that can be used for visualizations. This way, the information retrieval system we build would add more value to our client.
Front-End Kibana (FEK) CS5604 Fall 2019
Powell, Edward; Liu, Han; Huang, Rong; Sun, Yanshen; Xu, Chao (Virginia Tech, 2020-01-13)
During the last two decades, web search engines have been driven to new quality levels due to the continuous efforts made to optimize the effectiveness of information retrieval. More and more people are becoming satisfied during their information retrieval processes, and web search has gradually replaced older methods, where people obtained information from each other or from libraries. Information retrieval systems are in constant interaction with users and help users interpret and analyze data. Currently, we are building the front end of a search engine, where users can explore information related to Tobacco Settlement documents from the University of California, San Francisco, as well as the Electronic Theses and Dissertations (ETDs) of Virginia Tech (and possibly other sites). This submission introduces the current work of the front-end team to build a functional user interface, which is one of the key components of a larger project to build a state-of-the-art search engine for two large datasets. We also seek to understand how users search for data, and accordingly provide the users with more insight and utilities from the two datasets with the help of the visualization tool Kibana. Already, a search website, where users can explore the two datasets, Tobacco Settlement dataset and ETDs dataset, has been created. A series of functionalities of the searching page have been realized, for instance, the login system, searching, filter functions, a Q&A page, and a visualization page.

Browse

Recent Submissions