CS5604: Information Retrieval

Permanent URI for this collection

This collection contains the final projects of the students in various offerings of the course Computer Science 5604: Information Retrieval. This course is taught by Professor Ed Fox. Analyzing, indexing, representing, storing, searching, retrieving, processing and presenting information and documents using fully automatic systems. The information may be in the form of text, hypertext, multimedia, or hypermedia. The systems are based on various models, e.g., Boolean logic, fuzzy logic, probability theory, etc., and they are implemented using inverted files, relational thesauri, special hardware, and other approaches. Evaluation of the systems' efficiency and effectiveness.

Browse

Recent Submissions

Now showing 1 - 20 of 51
  • Team 3: Object Detection and Topic Modeling (Objects&Topics) CS 5604 F2022
    Devera, Alan; Sahu, Raj; Masrourisaadat, Nila; Amirthalingam, Nirmal; Mao, Chenyu (Virginia Tech, 2023-01-17)
    The CS 5604: Information Storage and Retrieval class (Fall 2022), led by Dr. Edward Fox, has been assigned the task of designing and implementing a state-of-the-art information retrieval and analysis system that will support Electronic Theses & Dissertations (ETDs). Given a large collection of ETDs, we want to run different kinds of learning algorithms to categorize them into logical groups, and by the end, be able to suggest to an end-user the documents which are strongly related to the one they are looking for. The overall goal for the project is to have a service that can upload, search, and retrieve ETDs with their derived digital objects, in a human-readable format. Specifically, our team is tasked with analyzing documents using object detection and topic models, with the final deliverable being the Experimenter web page for the derived objects and topics. The object detection team worked with Faster R-CNN and YOLOv7 models, and implemented post-processing rules for saving objects in a structured format. As the final deliverable for object detection, inference on 5k ETDs has been completed, and the refined objects have been saved to the Repository. The topic modeling team worked with clustering ETDs to 10, 25, 50, and 100 topics with different models (LDA, NeuralLDA, CTM, ProdLDA). As the final deliverable for topic modeling, we store the related topics and related documents for 5k ETDs in the Team 1 database, so that Team 2 could provide the related topic and documents on the documents page. By the end of the semester the team was able to deliver the Experimenter web page for the derived objects and topics, and the related objects and topics for 5k ETDs stored in the Team 1 database.
  • Team 2 for End Users
    Paidiparthy, Manoj Prabhakar; Ramanujan, Ramaraja; Teegalapally, Akshita; Muralikrishnan, Madhuvanti; Balar, Romil Khimraj; Juvekar, Shaunak; Murali, Vivek (Virginia Tech, 2023-01-11)
    A huge collection of Electronic Theses and Dissertations (ETDs) has valuable information. However, accessing the information from these documents has proven to be challenging as the process is mostly manual. We propose to build a unique Information Retrieval System that will support searching, ranking, browsing, and recommendations for a large collection of ETDs. The system indexes the digital objects related to the ETD, like documents, chapters, etc. The user can then query the indexed objects through a carefully designed web interface. The web interface provides users with utilities to sort, filter, and query specific fields. We have incorporated machine learning models to support semantic search. To enhance user engagement, we provide the user with a list of recommended documents based on the user's actions and topics of interest. A total of 57,130 documents and 21,537 chapters were indexed. The system was tested by the Fall 2022 CS 5604 class, which had 28 members, and was found to fulfill most of the goals set out at the beginning of the semester.
  • CS5604: Team 1 ETD Collection Management
    Jain, Tanya; Bhagat, Hirva; Lee, Wen-Yu; Thukkaraju, Ashrith Reddy; Sethi, Raghav (Virginia Tech, 2023-01-13)
    Academic institutions the world over are known to produce hundreds of thousands of ETDs (Electronic Theses and Dissertations) every year. At the end of an academic year, we are left with large volumes of ETD data that are rarely used for further research or ever cited in future work, writings, or publications. As part of the CS5604: Information Storage and Retrieval graduate-level course at Virginia Polytechnic Institute and State University (Virginia Tech), we collectively created a search engine for a collection of more than 500,000 ETDs from academic institutions in the United States, which constitutes the class-wide project. This system enables users to ingest, pre-process, and store ETDs in a repository; apply deep learning models to perform topic modeling, text segmentation, chapter summarization, and classification, backed by a DevOps, user experience and integrations team. We are Team 1 or the “ETD Collection Management” team. During the course of the Fall 2022 semester at Virginia Tech, we were responsible for setting up the repository of ETDs, which encompasses broadly the following three components: (1) setting up a database, (2) storing digital objects in a file system, and (3) creating a knowledge graph. Our work enabled other teams to efficiently retrieve the stored ETD data, and perform appropriate pre-processing operations, and during the final few months of the semester, to apply the aforementioned deep learning models to the ETD collection we created. The key deliverable for Team 1 was to create an interactive user interface to perform CRUD operations (create, retrieve, update, and delete) in order to interact with the repository of ETDs, which is essentially an extrapolation of the work already taken up at Virginia Tech’s Digital Library Research Laboratory. Owing to the fact that the other teams had no direct access to the repository set up by us, we designed a host of Application Programming Interfaces (APIs) which are elaborated in depth in the subsequent sections of the report. The end goal for Team 1 was to be able to set up an accessible repository of ETDs so that they can be used for further research work. This is taking into account how each ETD is a well-curated resource and how it may even prove to be an excellent asset for an in-depth analysis on a certain topic, not limited to academic or research purposes.
  • CS5604 Fall 2022 - Team 5 INT
    Shukla, Anmol; Travasso, Aaron; Manogaran, Harish Babu; Sisodia, Pallavi Kishor; Li, Yuze (Virginia Tech, 2022-01-08)
    The primary objective of the project is to build a state-of-the-art system to search and retrieve relevant information effectively from a large corpus of electronic theses and dissertations. The system is targeted towards documents such as academic textbooks, dissertations and theses where the information available is enormous, compared to websites or blogs, which the conventional search engines are equipped to handle effectively. The entire work involved in developing the system has been divided into five areas such as data management (Team-1, Curator); search and retrieval (Team-2, User); object detection and topic analysis (Team-3, Objects & Topics); language models, classification, summarization and segmentation (Team-4, Classification & Summarization); and lastly integration (Team-5, Integration). The teams and their operations are structured in a way to mirror an environment of a company working on new product development. The Integration (INT) team focuses on one of the important aspects such as setting up work environments with all requirements for the teams, integrating the work done by the other four teams, and deploying suitable Docker containers for seamless operation (workflow) along with maintaining the cluster infrastructure. The INT team archives this distribution of code and containers on the Virginia Tech Docker Container Registry and deploys it on the Virginia Tech CS Cloud. The INT team also guides team evaluations of prospective container components and workflows. Additionally, the team implements continuous integration and continuous deployment to enable seamless integration, building and testing of code as they are developed. Furthermore, the team works on setting up a workflow management system that employs Apache Airflow to automate creating, scheduling, and monitoring of workflows. We have created customized containers for each team based on their individual requirements. We have developed a workflow management system using Apache Airflow that creates and manages workflows to achieve the goals of each team such as indexing, object detection, segmentation, summarization, and classification. We have also implemented a Continuous Integration and Continuous Deployment (CI/CD) pipeline to automatically create, test and deploy the updated image whenever a new push is made to a Git repository. Additionally, we extended our support to other teams in troubleshooting the issues they faced in deployment. Our current cluster statistics (i.e., Kubernetes Resource Definitions) are: 45 deployments, 40 ingresses, 39 pods, 180 services, and 13 secrets. Lastly, the INT team would like to express its gratitude to the work of the INT-2020 team and the predecessors who have done substantial work upon which we built. We would like to acknowledge here their significant contribution.
  • Team 4: Segmentation, Summarization, and Classification
    Ganesan, Kaushik; Nanjundan, Deepak; Srivastava, Deval; Neog, Abhilash; Jayaprakash, Dharneeshkar; Shah, Aditya (Virginia Tech, 2023-01-11)
    Under the guidance of Dr. Edward A. Fox, the class of CS5604 Fall 2022 semester at Virginia Tech was assigned the task of building an Information Retrieval and Analysis System that can support a collection of at least 5000 Electronic Theses and Dissertations (ETDs). The system would act as a search engine, supporting a number of features, such as searching, providing recommendations, ranking search results, and browsing. In order to achieve this, the class was divided into five teams, each assigned separate tasks with the intent of collaborating through CI/CD. The roles can be described as follows: Content and Representation, End-user Recommendation and Search, Object Detection and Topic Models, Classification and Summarization with Language Models, and Integration and Coordination. The intent of the report is to outline the contribution of Team 4, which focuses on language models, classification, summarization, and segmentation. In this project, Team 4 was successful in reproducing Akbar Javaid Manzoor’s pipeline to segment ETD into chapters, summarize the segmented chapters using extractive and abstractive summarizing techniques, and classify the chapters using deep learning and language models. Using the APIs developed by Team 1, Team 4 was also tasked with storing the outcomes of 5000 ETDs in the file system and database. Team 4 containerized the services and assisted Team 5 with workflow automation to help automate the services. The project’s main lessons were effective team collaboration, efficient code maintenance, containerization of services, upkeep of a CI/CD workflow, and finally effective information storage retrieval at scale. The report describes the goals, tasks, and achievements, along with our coordination with the other teams in completing the higher-level tasks concerning the entire project.
  • Integration and Implementation (INT) CS 5604 F2020
    Hicks, Alexander; Thazhath, Mohit; Gupta, Suraj; Long, Xingyu; Poland, Cherie; Hsieh, Hsinhan; Mahajan, Yash (Virginia Tech, 2020-12-18)
    The first major goal of this project is to build a state-of-the-art information storage, retrieval, and analysis system that utilizes the latest technology and industry methods. This system is leveraged to accomplish another major goal, supporting modern search and browse capabilities for a large collection of tweets from the Twitter social media platform, web pages, and electronic theses and dissertations (ETDs). The backbone of the information system is a Docker container cluster running with Rancher and Kubernetes. Information retrieval and visualization is accomplished with containers in a pipelined fashion, whether in the cluster or on virtual machines, for Elasticsearch and Kibana, respectively. In addition to traditional searching and browsing, the system supports full-text and metadata searching. Search results include facets as a modern means of browsing among related documents. The system supports text analysis and machine learning to reveal new properties of collection data. These new properties assist in the generation of available facets. Recommendations are also presented with search results based on associations among documents and with logged user activity. The information system is co-designed by five teams of Virginia Tech graduate students, all members of the same computer science class, CS 5604. Although the project is an academic exercise, it is the practice of the teams to work and interact as though they are groups within a company developing a product. The teams on this project include three collection management groups -- Electronic Theses and Dissertations (ETD), Tweets (TWT), and Web-Pages (WP) -- as well as the Front-end (FE) group and the Integration (INT) group to help provide the overarching structure for the application. This submission focuses on the work of the Integration (INT) team, which creates and administers Docker containers for each team in addition to administering the cluster infrastructure. Each container is a customized application environment that is specific to the needs of the corresponding team. Each team will have several of these containers set up in a pipeline formation to allow scaling and extension of the current system. The INT team also contributes to a cross-team effort for exploring the use of Elasticsearch and its internally associated database. The INT team administers the integration of the Ceph data storage system into the CS Department Cloud and provides support for interactions between containers and the Ceph filesystem. During formative stages of development, the INT team also has a role in guiding team evaluations of prospective container components and workflows. The INT team is responsible for the overall project architecture and facilitating the tools and tutorials that assist the other teams in deploying containers in a development environment according to mutual specifications agreed upon with each team. The INT team maintains the status of the Kubernetes cluster, deploying new containers and pods as needed by the collection management teams as they expand their workflows. This team is responsible for utilizing a continuous integration process to update existing containers. During the development stage the INT team collaborates specifically with the collection management teams to create the pipeline for the ingestion and processing of new collection documents, crossing services between those teams as needed. The INT team develops a reasoner engine to construct workflows with information goal as input, which are then programmatically authored, scheduled, and monitored using Apache Airflow. The INT team is responsible for the flow, management, and logging of system performance data and making any adjustments necessary based on the analysis of testing results. The INT team has established a Gitlab repository for archival code related to the entire project and has provided the other groups with the documentation to deposit their code in the repository. This repository will be expanded using Gitlab CI in order to provide continuous integration and testing once it is available. Finally, the INT team will provide a production distribution that includes all embedded Docker containers and sub-embedded Git source code repositories. The INT team will archive this distribution on the Virginia Tech Docker Container Registry and deploy it on the Virginia Tech CS Cloud. The INT-2020 team owes a sincere debt of gratitude to the work of the INT-2019 team. This is a very large undertaking and the wrangling of all of the products and processes would not have been possible without their guidance in both direct and written form. We have relied heavily on the foundation they and their predecessors have provided for us. We continue their work with systematic improvements, but also want to acknowledge their efforts Ibid. Without them, our progress to date would not have been possible.
  • CS 5604: Information Storage and Retrieval - Webpages (WP) Team
    Barry-Straume, Jostein; Vives, Cristian; Fan, Wentao; Tan, Peng; Zhang, Shuaicheng; Hu, Yang; Wilson, Tishauna (Virginia Tech, 2020-12-18)
    The first major goal of this project is to build a state-of-the-art information retrieval engine for searching webpages and for opening up access to existing and new webpage collections resulting from Digital Library Research Laboratory (DLRL) projects relating to eventsarchive.org. The task of the Webpage (WP) team was to provide the functionality of making any archived webpage accessible and indexed. The webpages can be obtained either through event focused crawlers or collections of data, such as WARC files containing webpages, or sets of tweets which contains embedded URLs. Toward completion of the project, the WP team worked on four major tasks: 1.) Contents of WARC files searchable through ElasticSearch. 2.) Contents of WARC files cleaned and searchable through ElasticSearch. 3.) Event focused crawler running and producing WARC files. 4.) Additional extracted/derived information (e.g., dates, classes) made searchable. The foundation of the software is a Docker container cluster employing Airflow, a Reasoner, and Kubernetes. The raw data of the information content of the given webpage collections is stored using the Network File System (NFS), while Ceph is used for persistent storage for the Docker containers. Retrieval, analysis, and visualization of the webpage collection is carried out with ElasticSearch and Kibana, respectively. These two technologies form an Elastic Stack application which serves as the vehicle with which the WP team indexes, maps, and stores the processed data and model outputs with regards to webpage collections. The software is co-designed by 7 team members of Virginia Tech graduate students, all members of the same computer science class, CS 5604: Information Storage and Retrieval. The course is taught by Professor Edward A. Fox. Dr. Fox structures the class in a way for his students to perform in a “mock” business development setting. In other words, the academic project submitted by the WP team for all intents and purposes can be viewed as a microcosm of software development within a corporate structure. This submission focuses on the work of the WP team, which creates and administers Docker containers such that various services are tested and deployed in whole. Said services pertain solely to the ingestion, cleansing, analysis, extraction, classification, and indexing of webpages and their respective content.
  • CS5604 (Information Retrieval) Fall 2020 Front-end (FE) Team Project
    Cao, Yusheng; Mazloom, Reza; Ogunleye, Makanjuola (Virginia Tech, 2020-12-16)
    With the demand and abundance of information increasing over the last two decades, generations of computer scientists are trying to improve the whole process of information searching, retrieval, and storage. With the diversification of the information sources, users' demand for various requirements of the data has also changed drastically both in terms of usability and performance. Due to the growth of the source material and requirements, correctly sorting, filtering, and storing has given rise to many new challenges in the field. With the help of all four other teams on this project, we are developing an information retrieval, analysis, and storage system to retrieve data from Virginia Tech's Electronic Thesis and Dissertation (ETD), Twitter, and Web Page archives. We seek to provide an appropriate data research and management tool to the users to access specific data. The system will also give certain users the authority to manage and add more data to the system. This project's deliverable will be combined with four others to produce a system usable by Virginia Tech's library system to manage, maintain, and analyze these archives. This report attempts to introduce the system components and design decisions regarding how it has been planned and implemented. Our team has developed a front end web interface that is able to search, retrieve, and manage three important content collection types: ETDs, tweets, and web pages. The interface incorporates a simple hierarchical user permission system, providing different levels of access to its users. In order to facilitate the workflow with other teams, we have containerized this system and made it available on the Virginia Tech cloud server. The system also makes use of a dynamic workflow system using a KnowledgeGraph and Apache Airflow, providing high levels of functional extensibility to the system. This allows curators and researchers to use containerised services for crawling, pre-processing, parsing, and indexing their custom corpora and collections that are available to them in the system.
  • CS 5604 2020: Information Storage and Retrieval TWT - Tweet Collection Management Team
    Baadkar, Hitesh; Chimote, Pranav; Hicks, Megan; Juneja, Ikjot; Kusuma, Manisha; Mehta, Ujjval; Patil, Akash; Sharma, Irith (Virginia Tech, 2020-12-16)
    The Tweet Collection Management (TWT) Team aims to ingest 5 billion tweets, clean this data, analyze the metadata present, extract key information, classify tweets into categories, and finally, index these tweets into Elasticsearch to browse and query. The main deliverable of this project is a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects. As a starting point, we focused on two development goals: (1) hashtag-based and (2) username-based search for tweets. For IR1, we completed extraction of two fields within our sample collection: hashtags and username. Sample code for TwiRole, a user-classification program, was investigated for use in our project. We were able to sample from multiple collections of tweets, spanning topics like COVID-19 and hurricanes. Initial work encompassed using a sample collection, provided via Google Drive. An NFS-based persistent storage was later involved to allow access to larger collections. In total, we have developed 9 services to extract key information like username, hashtags, geo-location, and keywords from tweets. We have also developed services to allow for parsing and cleaning of raw API data, and backup of data in an Apache Parquet filestore. All services are Dockerized and added to the GitLab Container Registry. The services are deployed in the CS cloud cluster to integrate services into the full search engine workflow. A service is created to convert WARC files to JSON for reading archive files into the application. Unit testing of services is complete and end-to-end tests have been conducted to improve system robustness and avoid failure during deployment. The TWT team has indexed 3,200 tweets into the Elasticsearch index. Future work could involve parallelization of the extraction of metadata, an alternative feature-flag approach, advanced geo-location inference, and adoption of the DMI-TCAT format. Key deliverables include a data body that allows for search, sort, filter, and visualization of raw tweet collections and metadata analysis; a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects; and a user guide to assist those using the system.
  • CS5604 Fall 2020: Electronic Thesis and Dissertation (ETD) Team
    Fan, Jiahui; Hardy, Nicolas; Furman, Samuel; Manzoor, Javaid; Nguyen, Alexander; Raghuraman, Aarathi (Virginia Tech, 2020-12-16)
    The Fall 2020 CS 5604 (Information Storage and Retrieval) class, led by Dr. Edward Fox, is building an information retrieval and analysis system that supports electronic theses and dissertations, tweets, and webpages. We are the Electronic Thesis and Dissertation Collection Management (ETD) team. The Virginia Tech Library maintains a corpus of 19,779 theses and 14,691 dissertations within the VTechWorks system. Our task was to research this data, implement data preprocessing and extraction techniques, index the data using Elasticsearch, and use machine learning techniques for each ETD's classification. These efforts were made in concert with teams working to process other content areas, and build content agnostic infrastructure. Prior work towards these tasks had been done in previous semesters of CS5604, and by students advised by Dr. Fox. That prior work serves as a foundation for our own work. Items of note were the works of Sampanna Kahu, Palakh Jude, and the Fall 2019 CS5604 CME team, which have been used either as part of our pipeline, or as the starting point for our work. Our team divided the creation of an ETD IR system into five subtasks: verify metadata of past teams, ingest text, index using Elasticsearch, extract figures and tables, and classify full documents and chapters. Each member of the team was assigned to a role, and a timeline was created to keep everyone on track. Text ingestion was done via ITCore and is accessible through an API. Our team did not perform any base metadata extraction since the Fall 2019 CS5604 CME had already done so, however we did still verify the quality of the data. Verification was done by hand and showed that most of the errors found in metadata from previous semesters were minor, but there were a few errors that could have lead to misclassification. However, since those major errors were few and far between, we decided that given the length of the project we could continue to use this metadata and added improved metadata extraction to our future goals. For figure and text extraction, we incorporated the work of Kahu. For classification, we first implemented the work of Jude, who has done previous work related to chapter classification. In addition, we created another classifier that is more accurate. Both methods are available as accessible APIs. The latter classifier is also available as a microservice. In addition, an Elasticsearch service was created as a single point of contact between the pipeline and Elasticsearch. It acts as the final part of the pipeline; the processing is complete when the document is indexed into Elasticsearch. The final deliverable is a pipeline of containerized services that can be used to store and index ETDs. Staging of the pipeline was handled by the integration team using Airflow and a reasoner engine to control resource management and data flow. The entire pipeline is then accessible via a website created by the frontend team. Given that Airflow defines the application pipeline based on dependencies between services and our chapter extraction service failed to build due to MySQL dependencies, we were unable to deploy an end-to-end Airflow system. All other services have been unit tested on Git Runner, containerized, and deployed to cloud.cs.vt.edu following a CI/CD pipeline. Future work includes expanding the available models for classification; expanding the available options for the extraction of text, figures, tables, and chapters; and adding more features that may be useful to researchers who would be interested in leveraging this pipeline. Another improvement would be to tackle some of the errors in metadata, such as that from previous teams.
  • Integration and Implementation (INT) CS5604 Fall 2019
    Agarwal, Rahul; Albahar, Hadeel; Roth, Eric; Sen, Malabika; Yu, Lixing (Virginia Tech, 2019-12-11)
    The first major goal of this project is to build a state-of-the-art information storage, retrieval, and analysis system that utilizes the latest technology and industry methods. This system is leveraged to accomplish the second major goal, supporting modern search and browse capabilities for two major content collections: (1) 200,000 ETDs (electronic theses and dissertations), and (2) 14 million settlement documents from the lawsuit wherein 39 U.S. states sued the major tobacco companies. The backbone of the information system is a Docker container cluster running with Rancher and Kubernetes. Information retrieval and visualization is accomplished with containers for Elasticsearch and Kibana, respectively. In addition to traditional searching and browsing, the system supports full-text and metadata searching. Search results include facets as a modern means of browsing among related documents. The system exercises text analysis and machine learning to reveal new properties of collection data. These new properties assist in the generation of available facets. Recommendations are also presented with search results based on associations among documents and with logged user activity. The information system is co-designed by 6 teams of Virginia Tech graduate students, all members of the same computer science class, CS 5604. Although the project is an academic exercise, it is the practice of the teams to work and interact as though they are groups within a company developing a product. These are the teams on this project: Collection Management ETDs (CME), Collection Management Tobacco Settlement Documents (CMT), Elasticsearch (ELS), Front-end and Kibana (FEK), Integration and Implementation (INT), and Text Analysis and Machine Learning (TML). This submission focuses on the work of the Integration (INT) team, which creates and administers Docker containers for each team in addition to administering the cluster infrastructure. Each container is a customized application environment that is specific to the needs of the corresponding team. For example, the ELS team container environment shall include Elasticsearch with its internal associated database. INT also administers the integration of the Ceph data storage system into the CS Department Cloud and provides support for interactions between containers and Ceph. During formative stages of development, INT also has a role in guiding team evaluations of prospective container components. Beyond the project formative stages, INT has the responsibility of deploying containers in a development environment according to mutual specifications agreed upon with each team. The development process is fluid. INT services team requests for new containers and updates to existing containers in a continuous integration process until the first system testing environment is completed. During the development stage INT also collaborates with the CME and CMT teams on the data pipeline subsystems for the ingestion and processing of new collection documents. With the testing environment established, the focus of the INT team shifts toward gathering of system performance data and making any systemic adjustments necessary based on the analysis of testing results. Finally, INT provides a production distribution that includes all embedded Docker containers and sub-embedded Git source code repositories. INT archives this distribution on Docker Hub and deploys it on the Virginia Tech CS Cloud.
  • Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019
    Kaushal, Kulendra Kumar; Kulkarni, Rutwik; Sumant, Aarohi; Wang, Chaoran; Yuan, Chenhan; Yuan, Liling (Virginia Tech, 2019-12-23)
    The class ``CS 5604: Information Storage and Retrieval'' in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD corpus consists of 14,055 doctoral dissertations and 19,246 masters theses from Virginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text analytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc. The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF files from the ETD corpus and extracting well-formatted text files from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing different parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We finally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correctness of the data that flows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch.
  • Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019
    Muhundan, Sushmethaa; Bendelac, Alon; Zhao, Yan; Svetovidov, Andrei; Biswas, Debasmita; Marin Thomas, Ashin (Virginia Tech, 2019-12-11)
    Consumption of tobacco causes health issues, both mental and physical. Despite this widely known fact, tobacco companies had sustained their huge presence in the market over the past century owing to a variety of successful marketing strategies. This report documents the work of the Collection Management Tobacco Settlement Documents (CMT) team, the data ingestion team for the tobacco documents. We deal with an archive of tobacco documents that were produced during litigation between the United States and seven major tobacco industry organizations. Our aim is to process these documents and assist Dr. David M. Townsend, an assistant professor at Virginia Polytechnic Institute and State University (Virginia Tech) Pamplin College of Business, in his research towards understanding the marketing strategies of the tobacco companies. The team is part of a larger initiative: to build a state-of-the-art information retrieval and analysis system. We handle over 14 million tobacco settlement documents as part of this project. Our tasks include extracting the data as well as metadata from these documents. We cater to the needs of the ElasticSearch (ELS) team and the Text Analytics and Machine Learning (TML) team. We provide tobacco settlement data in suitable formats to enable them to process and feed the data into the information retrieval system. We have successfully processed both the metadata and the document texts into a usable format. For metadata, this involved collaborating with the above-mentioned teams to come up with a suitable format. We retrieved the metadata from a MySQL database and converted it into a JSON for Elasticsearch ingestion. For the data, this involved lemmatization, tokenization, and text cleaning. We have supplied the entire dataset to the ELS and TML teams. Data, as well as metadata of these documents, were cleaned and provided. Python scripts were used to query the database and output the results in the required format. We also closely interacted with Dr. Townsend to understand his research needs in order to guide the Front-end and Kibana (FEK) team in terms of insights about features that can be used for visualizations. This way, the information retrieval system we build would add more value to our client.
  • Front-End Kibana (FEK) CS5604 Fall 2019
    Powell, Edward; Liu, Han; Huang, Rong; Sun, Yanshen; Xu, Chao (Virginia Tech, 2020-01-13)
    During the last two decades, web search engines have been driven to new quality levels due to the continuous efforts made to optimize the effectiveness of information retrieval. More and more people are becoming satisfied during their information retrieval processes, and web search has gradually replaced older methods, where people obtained information from each other or from libraries. Information retrieval systems are in constant interaction with users and help users interpret and analyze data. Currently, we are building the front end of a search engine, where users can explore information related to Tobacco Settlement documents from the University of California, San Francisco, as well as the Electronic Theses and Dissertations (ETDs) of Virginia Tech (and possibly other sites). This submission introduces the current work of the front-end team to build a functional user interface, which is one of the key components of a larger project to build a state-of-the-art search engine for two large datasets. We also seek to understand how users search for data, and accordingly provide the users with more insight and utilities from the two datasets with the help of the visualization tool Kibana. Already, a search website, where users can explore the two datasets, Tobacco Settlement dataset and ETDs dataset, has been created. A series of functionalities of the searching page have been realized, for instance, the login system, searching, filter functions, a Q&A page, and a visualization page.
  • Elasticsearch (ELS) CS5604 Fall 2019
    Li, Yuan; Chekuri, Satvik; Hu, Tianrui; Kumar, Soumya Arvind; Gill, Nicholas (Virginia Tech, 2019-12-12)
    We are building an Information and Retrieval System that will work as a search engine to support searching, ranking, browsing, and recommendations for two large collections of data. The first collection is part of Virginia Tech's collection of Electronic Theses and Dissertations (ETDs). The Virginia Tech Library has a large collection of ETDs. Currently, there is an effort being made to digitize the pre-1997 theses and dissertations and load them into VTechWorks. Our data set contains over 30K ETDs. The second collection is of tobacco settlement documents. There are 14 million documents in this data set. We are using a CEPH container to store and retrieve information. To achieve its goals, the project has six teams: Collection Management ETDs, Collection Management Tobacco Settlement Documents, Elasticsearch, Front-end and Kibana, Integration and Implementation, and Text Analytics and Machine Learning. This report addresses the work performed by the Elasticsearch team. The Elasticsearch team helps to enable searching and browsing, which are supported based on: facets associated with information extracted from documents, analysis, classification, clustering, summarization, and other processing. The report describes goals, overview, and the process of implementation with Elasticsearch. The Elasticsearch team works closely with the Kibana and Text Machine Learning groups. The data ingested in Elasticsearch is provided to the Front End team for further visualization. Thus, the report also describes the connections established with the other groups, as a high-level overview of the course project. The user manuals have been provided for the reference of other groups.
  • Text Analytics and Machine Learning (TML) CS5604 Fall 2019
    Mansur, Rifat Sabbir; Mandke, Prathamesh; Gong, Jiaying; Bharadwaj, Sandhya M.; Juvekar, Adheesh Sunil; Chougule, Sharvari (Virginia Tech, 2019-12-29)
    In order to use the burgeoning amount of data for knowledge discovery, it is becoming increasingly important to build efficient and intelligent information retrieval systems.The challenge in informational retrieval lies not only in fetching the documents relevant to a query but also in ranking them in the order of relevance. The large size of the corpora as well as the variety in the content and the format of information pose additional challenges in the retrieval process. This calls for the use of text analytics and machine learning techniques to analyze and extract insights from the data to build an efficient retrieval system that enhances the overall user experience. With this background, the goal of the Text Analytics and Machine Learning team is to suitably augment the document indexing and demonstrate a qualitative improvement in the document retrieval. Further, we also plan to make use of document browsing and viewing logs to provide meaningful recommendations to the user. The goal of the class is to build an end-to-end information retrieval system for two document corpora, viz., Electronic Theses & Dissertations (ETDs) and Tobacco Settlement Records (TSRs). The ETDs are a collection of over 33,000 thesis and dissertation documents in VTechWorks at Virginia Tech. The challenge in building a retrieval system around this corpus lies in the distinct nature of ETDs as opposed to other well studied document formats such as conference/journal publications and web-pages. The TSR corpus consists of over 14M records covering formats ranging from letters and memos to image based advertisements. We seek to understand the nature of both these corpora as well as the information need patterns of the users in order to augment the index based search with domain specific information using machine learning based methods. Extending prior experiments, we investigate reasons for the unbalanced nature of the clusters from the previous iterations of the K-Means algorithm on the tobacco data. In addition, we explore and present preliminary results of running Agglomerative Clustering on a small subset of the tobacco data. We also explored different pre-trained models of detecting sentiments. We identified a package, empath, that shows better results in identifying emotions in the tobacco deposition documents. Besides, we implemented text summarization based on both Latent Semantic Analysis and the Luhn Algorithm on the tobacco (article) data (38,038 documents). We also implemented text summarization on a sample ETD chapter dataset.
  • Collection Management Tweets Project Fall 2017
    Khaghani, Farnaz; Zeng, Junkai; Bhuiyan, Momen; Tabassum, Anika; Bandyopadhyay, Payel (Virginia Tech, 2018-01-17)
    The report included in this submission documents the work by the Collection Management Tweets (CMT) team, which is a part of the bigger effort in CS5604 on building a state-of-the-art information retrieval and analysis system for the IDEAL (Integrated Digital Event Archiving and Library) and GETAR (Global Event and Trend Archive Research) projects. The mission of the CMT team had two parts: 1) Cleaning 6.2 million tweets from two 2017 event collections named "Solar Eclipse" and "Las Vegas Shooting", and loading them into HBase, an open source, non-relational, distributed database that runs on the Hadoop distributed file system, in support of further use; and 2) Building and storing a social network for the tweet data using a triple-store. For the first part, our work included: A) Making use of the work done by the previous year's class group, where incremental update was done, to introduce a faster development process of data collection and storing; B) Improving the performance of work done by the group from last year. Previously, the cleaning part, e.g., removing profanity words, plus extracting hashtags and mentions, utilized Python. This becomes very slow when the dataset scales up. We introduced parallelization in our tweet cleaning process with the help of Scala and the Hadoop cluster, and made use of different Natural Language Processing libraries for stop word and profanity removal; C) Along with tweet cleaning we also identified and stored Named-Entity-Recognition (NER) entries and Part-of-speech (POS) tags, with the tweets which was not done by the previous team. The cleaned data in HBase from this task is provided to the Classification team for spam detection and to the Clustering and Topic Analysis team for topic analysis. Collection Management Webpage team uses the extracted URLs from the tweets for further processing. Finally, after the data is indexed by the SOLR team, the Front-End team visualizes the tweets to users, and provides access for searching and browsing. In addition to the aforementioned tasks, our responsibilities also included building a network of tweets. This entailed doing research into the types of database that are appropriate for this graph. For storing the network, we used a triple-store database to record different types of edges and relationships in the graph. We also researched methods ascribing importance to nodes and edges in our social networks once they were constructed, and analyzed our networks using these techniques.
  • CS5604 Information Storage and Retrieval Fall 2017 Solr Report
    Kumar, Abhinav; Bangad, Anand; Robertson, Jeff; Garg, Mohit; Ramesh, Shreyas; Mi, Siyu; Wang, Xinyue; Wang, Yu (Virginia Tech, 2018-01-15)
    The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets and millions of webpages for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. We are using a 21 node Cloudera Hadoop cluster to store and retrieve this information. One goal of this project is to expand the data collection to include more web archives and geospatial data beyond what previously had been collected. Another important part in this project is optimizing the current system to analyze and allow access to the new data. To accomplish these goals, this project is separated into 6 parts with corresponding teams: Classification (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Front-end (FE), and SOLR. The report describes the work completed by the SOLR team which improves the current searching and storage system. We include the general architecture and an overview of the current system. We present the part that Solr plays within the whole system with more detail. We talk about our goals, procedures, and conclusions on the improvements we made to the current Solr system. This report also describes how we coordinate with other teams to accomplish the project at a higher level. Additionally, we provide manuals for future readers who might need to replicate our experiments. The main components within the Cloudera Hadoop cluster that the SOLR team interacts with include: Solr searching engine, HBase database, Lily indexer, Hive database, HDFS file system, Solr recommendation plugin, and Mahout. Our work focuses on HBase design, data quality control, search recommendations, and result ranking. Overall, throughout the semester, we have processed 12,564 web pages and 5.9 million tweets. In order to cooperate with Geo Blacklight, we make major changes on the Solr schema. We also function as a data quality control gateway for the Front End team and deliver the finalized data for them. As to search recommendation, we provide search recommendation such as the MoreLikeThis plugin within Solr for recommending related records from search results, and a custom recommendation system based on user behavior to provide user based search recommendations. After the fine tuning over the final weeks of semester, we successfully allowed effective connection of results from data provided by other teams, and delivered them to the front end through a Solr core.
  • CS5604 Fall 2017 Clustering and Topic Analysis
    Baghudana, Ashish; Ahuja, Aman; Bellam, Pavan; Chintha, Rammohan; Sambaturu, Pratyush; Malpani, Ashish; Shetty, Shruti; Yang, Mo (Virginia Tech, 2018-01-13)
    One of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to build a pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETAR project, in collaboration with the Internet Archive, aims to develop an archive of webpages and tweets related to multiple events and trends that occur in the world, and develop a retrieval system to extract information from that archive. Since it is practically impossible to manually look through all the documents in a large corpus, an important component of any retrieval system is a module that is able to group and summarize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build this component for the GETAR project. Our report examines the various techniques underlying clustering and topic analysis, discusses technology choices and implementation details, and, describes the results of the k-means algorithm and latent Dirichlet allocation (LDA) on different collections of webpages and tweets. Subsequently, we provide a developer manual to help set up our framework, and finally, outline a user manual describing the fields that we populate in HBase.
  • CS5604 Fall 2017 Classification Team Submission
    Azizi, Ahmadreza; Mulchandani, Deepika; Naik, Amit; Ngo, Khai; Patil, Suraj; Vezvaee, Arian; Yang, Robin (Virginia Tech, 2018-01-03)
    This project submission includes the work of the 'Classification' team of the CS5604 'Information Storage and Retrieval' course of Fall 2017 towards the GETAR project. Classification of the GETAR data would allow users to analyze, visualize, and explore content related to crises, disasters, human rights, inequality, population growth, shootings, violence, etc. Binary classification models were trained for different events for both tweet and webpage collections. Word2Vec was used as the feature selection technique and the Word2Vec model was trained on the entire corpus available. Logistic Regression was used as our classification technique. As part of this submission, we detail our classification framework and the experiments that we conducted. We also give an insight into the challenges we faced, how we overcame those challenges, and also what we learned in the process. We also provide the code that we implemented and the models that were built to classify 1,562,215 tweets and 4,366 webpages.