CS5604: Information Retrieval
Permanent URI for this collection
This collection contains the final projects of the students in various offerings of the course Computer Science 5604: Information Retrieval. This course is taught by Professor Ed Fox.
Analyzing, indexing, representing, storing, searching, retrieving, processing and presenting information and documents using fully automatic systems. The information may be in the form of text, hypertext, multimedia, or hypermedia. The systems are based on various models, e.g., Boolean logic, fuzzy logic, probability theory, etc., and they are implemented using inverted files, relational thesauri, special hardware, and other approaches. Evaluation of the systems' efficiency and effectiveness.
Browse
Browsing CS5604: Information Retrieval by Content Type "Report"
Now showing 1 - 20 of 35
Results Per Page
Sort Options
- CINETGraphCrawl - Constructing graphs from blogsKaw, Rushi; Subbiah, Rajesh; Makkapati, Hemanth (2012-12-11)Internet forums, weblogs, social networks, and photo and video sharing websites are some forms of social media that are at the forefront of enabling communication among individuals. The rich information captured in social media has enabled a variety behavioral research assisting domains such as marketing, finance, public health and governance etc. Furthermore, social media is believed to be capable of providing valuable insights into understanding information diffusion phenomena such as social influence, opinion formation and rumor spread. Here, we propose a semi-automated approach with prototype implementation that constructs interaction graphs to enable such behavioral studies. We construct first and second degree interaction graphs from Stackoverflow, a programming forum, and CNN Political Ticker, a political news blog.
- Clustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016Bartolome, Abigail; Islam, M. D.; Vundekode, Soumya (Virginia Tech, 2016-12-08)The IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects aim to build a robust Information Retrieval (IR) system by retrieving tweets and webpages from social media and the World Wide Web, and indexing them to be easily retrieved and analyzed. The project has been divided into different segments - Classification (CLA), Collection Management (tweets - CMT and webpages - CMW), Clustering and Topic Analysis (CTA), SOLR, and Front-End (FE). In building IR systems, documents are scored for relevance. To assist in determining a document’s relevance to a query, it is useful to know what topics are associated with the documents and what other documents relate to it. We, as the CTA team, used topic analysis and clustering techniques to aid in building this IR system. Our contributions were useful in scoring which documents are most relevant to a user’s query. We ran clustering and topic analysis algorithms on collections of tweets and webpages to identify the most discussed topics and grouped them into clusters along with their respective probabilities. We also labeled the topics and clusters, aiming for intuitive labels. The report and presentation cover the background, requirements, design and implementation of our contributions to this project. We evaluated the quality of our methodologies and describe improvements or future work that could be done to extend our project. Furthermore, we include a user manual and a developer manual to assist in any future work that may come from our efforts.
- Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019Kaushal, Kulendra Kumar; Kulkarni, Rutwik; Sumant, Aarohi; Wang, Chaoran; Yuan, Chenhan; Yuan, Liling (Virginia Tech, 2019-12-23)The class ``CS 5604: Information Storage and Retrieval'' in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD corpus consists of 14,055 doctoral dissertations and 19,246 masters theses from Virginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text analytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc. The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF files from the ETD corpus and extracting well-formatted text files from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing different parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We finally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correctness of the data that flows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch.
- Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019Muhundan, Sushmethaa; Bendelac, Alon; Zhao, Yan; Svetovidov, Andrei; Biswas, Debasmita; Marin Thomas, Ashin (Virginia Tech, 2019-12-11)Consumption of tobacco causes health issues, both mental and physical. Despite this widely known fact, tobacco companies had sustained their huge presence in the market over the past century owing to a variety of successful marketing strategies. This report documents the work of the Collection Management Tobacco Settlement Documents (CMT) team, the data ingestion team for the tobacco documents. We deal with an archive of tobacco documents that were produced during litigation between the United States and seven major tobacco industry organizations. Our aim is to process these documents and assist Dr. David M. Townsend, an assistant professor at Virginia Polytechnic Institute and State University (Virginia Tech) Pamplin College of Business, in his research towards understanding the marketing strategies of the tobacco companies. The team is part of a larger initiative: to build a state-of-the-art information retrieval and analysis system. We handle over 14 million tobacco settlement documents as part of this project. Our tasks include extracting the data as well as metadata from these documents. We cater to the needs of the ElasticSearch (ELS) team and the Text Analytics and Machine Learning (TML) team. We provide tobacco settlement data in suitable formats to enable them to process and feed the data into the information retrieval system. We have successfully processed both the metadata and the document texts into a usable format. For metadata, this involved collaborating with the above-mentioned teams to come up with a suitable format. We retrieved the metadata from a MySQL database and converted it into a JSON for Elasticsearch ingestion. For the data, this involved lemmatization, tokenization, and text cleaning. We have supplied the entire dataset to the ELS and TML teams. Data, as well as metadata of these documents, were cleaned and provided. Python scripts were used to query the database and output the results in the required format. We also closely interacted with Dr. Townsend to understand his research needs in order to guide the Front-end and Kibana (FEK) team in terms of insights about features that can be used for visualizations. This way, the information retrieval system we build would add more value to our client.
- Collection Management Tweets Project Fall 2017Khaghani, Farnaz; Zeng, Junkai; Bhuiyan, Momen; Tabassum, Anika; Bandyopadhyay, Payel (Virginia Tech, 2018-01-17)The report included in this submission documents the work by the Collection Management Tweets (CMT) team, which is a part of the bigger effort in CS5604 on building a state-of-the-art information retrieval and analysis system for the IDEAL (Integrated Digital Event Archiving and Library) and GETAR (Global Event and Trend Archive Research) projects. The mission of the CMT team had two parts: 1) Cleaning 6.2 million tweets from two 2017 event collections named "Solar Eclipse" and "Las Vegas Shooting", and loading them into HBase, an open source, non-relational, distributed database that runs on the Hadoop distributed file system, in support of further use; and 2) Building and storing a social network for the tweet data using a triple-store. For the first part, our work included: A) Making use of the work done by the previous year's class group, where incremental update was done, to introduce a faster development process of data collection and storing; B) Improving the performance of work done by the group from last year. Previously, the cleaning part, e.g., removing profanity words, plus extracting hashtags and mentions, utilized Python. This becomes very slow when the dataset scales up. We introduced parallelization in our tweet cleaning process with the help of Scala and the Hadoop cluster, and made use of different Natural Language Processing libraries for stop word and profanity removal; C) Along with tweet cleaning we also identified and stored Named-Entity-Recognition (NER) entries and Part-of-speech (POS) tags, with the tweets which was not done by the previous team. The cleaned data in HBase from this task is provided to the Classification team for spam detection and to the Clustering and Topic Analysis team for topic analysis. Collection Management Webpage team uses the extracted URLs from the tweets for further processing. Finally, after the data is indexed by the SOLR team, the Front-End team visualizes the tweets to users, and provides access for searching and browsing. In addition to the aforementioned tasks, our responsibilities also included building a network of tweets. This entailed doing research into the types of database that are appropriate for this graph. For storing the network, we used a triple-store database to record different types of edges and relationships in the graph. We also researched methods ascribing importance to nodes and edges in our social networks once they were constructed, and analyzed our networks using these techniques.
- Collection Management WebpagesEagan, Mackenzie; Liang, Xiao; Michael, Louis; Patil, Supritha (Virginia Polytechnic Institute and State University, 2017-12-25)The Collection Management Webpages team is responsible for collecting, processing, and storing webpages from different sources. Our team worked on familiarizing ourselves with the necessary tools and data required to produce the specified output that was used by other teams in this class (Fall 2017 CS 5604). Input includes URLs generated by the Event Focused Crawler (EFC), URLs obtained from tweets by the Collection Management Tweets team, and webpage content from Web Archive (WARC) files from the Internet Archive or other sources. Our team fetches raw HTML from the obtained URLs and extracts HTML from WARC files. From this raw data, we obtain metadata information about the corresponding webpage. The raw data is also cleaned and processed for other teams' consumption. This processing is accomplished using various Python libraries. The cleaned information is made available in a variety of formats, including tokens, stemmed or lemmatized text, and text tagged with parts of speech. Both the raw and processed webpage data are stored in HBase and intermediately in HDFS (Hadoop Distributed File System). Our team successfully executed all individual portions of our proposed process. We successfully ran the EFC and obtained URLs from these runs. Using these URLs, we created WARC files. We obtained the raw HTML, extracted metadata information from it, and cleaned and processed the webpage information before uploading it to HBase. We iteratively expanded on the functionality of our cleaning and processing scripts in order to provide more relevant information to other groups. We processed and cleaned information from WARC files provided by the instructor in a similar manner. We have acquired webpage data from URLs obtained by the Collection Management Tweets (CMT) team. At this time however, there is no end-to-end process in place. Due to the volume of data our team has been dealing with, we explored various methods for parallelizing and speeding up our processes. Our team used the PySpark library for obtaining information from URLs and the multiprocessing library in Python for processing information stored in WARC files.
- Collection Management Webpages - Fall 2016 CS5604Dao, Tung; Wakeley, Christopher; Weigang, Liu (Virginia Tech, 2017-03-23)The Collection Management Webpages (CMW) team is responsible for collecting, processing and storing webpages from different sources including tweets from multiple collections and contributors, such as those related to events and trends studied in local projects like IDEAL/GETAR, and webpage archives collected by Pranav Nakate, Mohamed Farag, and others. Thus, based on webpage sources, we divide our work into the three following deliverable and manageable tasks. The first task is to fetch the webpages mentioned in the tweets that are collected by the Collection Management Tweets (CMT) team. Those webpages are then stored in WARC files, processed, and loaded into HBase. The second task is to run focused crawls for all of the events mentioned in IDEAL/GETAR to collect relevant webpages. And similar to the first task, we would then store the webpages into WARC files, process them, and load them into HBase. We also plan to achieve the third task which is similar to the first two, except that the webpages are from archives collected by the people previously involved in the project. Since these tasks are time-consuming and sensitive to real-time processing requirements, it is essential that our approach be incremental, meaning that webpages need to be incrementally collected, processed, and stored to HBase. We have conducted multiple experiments for the first, second, and third tasks, on our local machines as well as the cluster. For the second task, we manually collected a number of seed URLs of events, namely “South China Sea Disputes”, “USA President Election 2016”, and “South Korean President Protest”, to train the focused event crawler, and then ran the trained model on a small number of URLs that are randomly generated as well as manually collected. Encouragingly, these experiments ran successfully; however, we still have to work to scale up the experimenting data to be systematically run on the cluster. The two main components to be further improved and tested are the HBase data connector and handler, and the focused event crawler. While focusing on our own tasks, the CMW team works closely with other teams whose inputs and outputs depend on our team. For example, the front-end (FE) team might use our results for their front-end content. We discussed with the Classification (CLA) team to have some agreements on filtering and noise reducing tasks. Also, we made sure that we would get the right format URLs from the Collection Management Tweets (CMT) team. In addition, the other two teams, Clustering and Topic Analysis (CTA) and SOLR, will use our team’s outputs for topic analyzing and indexing, respectively. For instance, based on the SOLR team’s requests and consensus, we have finalized a schema (i.e., specific fields of information) for a webpage to be collected and stored. In this final report, we report our CMW team’s overall results and progress. Essentially, this report is a revised version of our three interim reports based on Dr. Fox’s and peer-reviewers’ comments. Besides to this revising, we continue reporting our ongoing work, challenges, processes, evaluations, and plans.
- CS 5604 2020: Information Storage and Retrieval TWT - Tweet Collection Management TeamBaadkar, Hitesh; Chimote, Pranav; Hicks, Megan; Juneja, Ikjot; Kusuma, Manisha; Mehta, Ujjval; Patil, Akash; Sharma, Irith (Virginia Tech, 2020-12-16)The Tweet Collection Management (TWT) Team aims to ingest 5 billion tweets, clean this data, analyze the metadata present, extract key information, classify tweets into categories, and finally, index these tweets into Elasticsearch to browse and query. The main deliverable of this project is a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects. As a starting point, we focused on two development goals: (1) hashtag-based and (2) username-based search for tweets. For IR1, we completed extraction of two fields within our sample collection: hashtags and username. Sample code for TwiRole, a user-classification program, was investigated for use in our project. We were able to sample from multiple collections of tweets, spanning topics like COVID-19 and hurricanes. Initial work encompassed using a sample collection, provided via Google Drive. An NFS-based persistent storage was later involved to allow access to larger collections. In total, we have developed 9 services to extract key information like username, hashtags, geo-location, and keywords from tweets. We have also developed services to allow for parsing and cleaning of raw API data, and backup of data in an Apache Parquet filestore. All services are Dockerized and added to the GitLab Container Registry. The services are deployed in the CS cloud cluster to integrate services into the full search engine workflow. A service is created to convert WARC files to JSON for reading archive files into the application. Unit testing of services is complete and end-to-end tests have been conducted to improve system robustness and avoid failure during deployment. The TWT team has indexed 3,200 tweets into the Elasticsearch index. Future work could involve parallelization of the extraction of metadata, an alternative feature-flag approach, advanced geo-location inference, and adoption of the DMI-TCAT format. Key deliverables include a data body that allows for search, sort, filter, and visualization of raw tweet collections and metadata analysis; a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects; and a user guide to assist those using the system.
- CS 5604 INFORMATION STORAGE AND RETRIEVAL Front-End Team Fall 2016 Final ReportKohler, Rachel; Tasooji, Reza; Sullivan, Patrick (Virginia Tech, 2016-12-08)Information Retrieval systems are a common tool for building research and disseminating knowledge. For this to be possible, these systems must be able to effectively show varying amounts of relevant information to the user. The information retrieval system is in constant interaction with the user, who can modify the direction of their search as they gain more information. The front-end of the information retrieval system is where this important communication happens. As members of Dr. Fox's class on Information Storage and Retrieval, we are tasked with understanding and making progress toward answering the question: how can we best build a state-of-the-art information retrieval and analysis system in support of the IDEAL (Integrated Digital Event Archiving and Library) and GETAR (Global Event and Trend Archive Research) projects? As the front-end design and development team, our responsibility to this project is in creating an interface for users to explore large collections of tweet and webpage data. Our goal in this research effort is to understand how users search for information and to support these efforts with an accurate and usable interface. We support various methods of searching, such as query driven searches, faceted search and browsing, and filtering of information by topic. We implemented user management and logging to support future work in recommendations. Additionally, we integrated a framework for future efforts in providing users with insightful visualizations which will allow them to explore social network and document interrelation data.
- CS 5604: Information Storage and Retrieval - Webpages (WP) TeamBarry-Straume, Jostein; Vives, Cristian; Fan, Wentao; Tan, Peng; Zhang, Shuaicheng; Hu, Yang; Wilson, Tishauna (Virginia Tech, 2020-12-18)The first major goal of this project is to build a state-of-the-art information retrieval engine for searching webpages and for opening up access to existing and new webpage collections resulting from Digital Library Research Laboratory (DLRL) projects relating to eventsarchive.org. The task of the Webpage (WP) team was to provide the functionality of making any archived webpage accessible and indexed. The webpages can be obtained either through event focused crawlers or collections of data, such as WARC files containing webpages, or sets of tweets which contains embedded URLs. Toward completion of the project, the WP team worked on four major tasks: 1.) Contents of WARC files searchable through ElasticSearch. 2.) Contents of WARC files cleaned and searchable through ElasticSearch. 3.) Event focused crawler running and producing WARC files. 4.) Additional extracted/derived information (e.g., dates, classes) made searchable. The foundation of the software is a Docker container cluster employing Airflow, a Reasoner, and Kubernetes. The raw data of the information content of the given webpage collections is stored using the Network File System (NFS), while Ceph is used for persistent storage for the Docker containers. Retrieval, analysis, and visualization of the webpage collection is carried out with ElasticSearch and Kibana, respectively. These two technologies form an Elastic Stack application which serves as the vehicle with which the WP team indexes, maps, and stores the processed data and model outputs with regards to webpage collections. The software is co-designed by 7 team members of Virginia Tech graduate students, all members of the same computer science class, CS 5604: Information Storage and Retrieval. The course is taught by Professor Edward A. Fox. Dr. Fox structures the class in a way for his students to perform in a “mock” business development setting. In other words, the academic project submitted by the WP team for all intents and purposes can be viewed as a microcosm of software development within a corporate structure. This submission focuses on the work of the WP team, which creates and administers Docker containers such that various services are tested and deployed in whole. Said services pertain solely to the ingestion, cleansing, analysis, extraction, classification, and indexing of webpages and their respective content.
- CS5604 (Information Retrieval) Fall 2020 Front-end (FE) Team ProjectCao, Yusheng; Mazloom, Reza; Ogunleye, Makanjuola (Virginia Tech, 2020-12-16)With the demand and abundance of information increasing over the last two decades, generations of computer scientists are trying to improve the whole process of information searching, retrieval, and storage. With the diversification of the information sources, users' demand for various requirements of the data has also changed drastically both in terms of usability and performance. Due to the growth of the source material and requirements, correctly sorting, filtering, and storing has given rise to many new challenges in the field. With the help of all four other teams on this project, we are developing an information retrieval, analysis, and storage system to retrieve data from Virginia Tech's Electronic Thesis and Dissertation (ETD), Twitter, and Web Page archives. We seek to provide an appropriate data research and management tool to the users to access specific data. The system will also give certain users the authority to manage and add more data to the system. This project's deliverable will be combined with four others to produce a system usable by Virginia Tech's library system to manage, maintain, and analyze these archives. This report attempts to introduce the system components and design decisions regarding how it has been planned and implemented. Our team has developed a front end web interface that is able to search, retrieve, and manage three important content collection types: ETDs, tweets, and web pages. The interface incorporates a simple hierarchical user permission system, providing different levels of access to its users. In order to facilitate the workflow with other teams, we have containerized this system and made it available on the Virginia Tech cloud server. The system also makes use of a dynamic workflow system using a KnowledgeGraph and Apache Airflow, providing high levels of functional extensibility to the system. This allows curators and researchers to use containerised services for crawling, pre-processing, parsing, and indexing their custom corpora and collections that are available to them in the system.
- CS5604 Fall 2016 Classification Team Final ReportWilliamson, Eric R.; Chakravarty, Saurabh (Virginia Tech, 2016-12-08)Content is generated on the Web at an exponential rate. The type of content varies from text on a traditional webpage to text on social media portals (e.g., social network sites and microblogs). One such example of social media is the microblogging site Twitter. Twitter is known for its high level of activity during live events, natural disasters, and events of global importance. Improving text classification results on Twitter data would pave the way to categorize the tweets into human defined real world events. This would allow diverse stakeholder communities to interactively collect, organize, browse, visualize, analyze, summarize, and explore content and sources related to crises, disasters, human rights, inequality, population growth, resiliency, shootings, sustainability, violence, etc. Challenges with the data in the Twitter universe include that the text length is limited to 160 characters. Because of this limitation, the vocabulary in the Twitter universe has taken its own form of short abbreviations of sentences, emojis, hashtags, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on tweets. Sophisticated text processing techniques like cleaning, lemmatizing, and removal of stop words and special characters will give us clean text which can be further processed to derive richer word semantic and syntactic relationships using state of the art feature selection techniques like Word2Vec. Machine learning techniques using word features that capture semantic and context relationships have been shown to give state of the art classification accuracy. To check the efficacy of our classifier, we would compare our experimental results with an association rules (AR) classifier. This classifier composes its rules around the most discriminating words in the training data. The hierarchy of rules along with an ability to tune to the support threshold makes it an effective classifier for scenarios where short text is involved. We developed a system where we read the tweets from HBase and write the classification label back after the classification step. We use domain oriented pre-processing on the tweets, and Word2Vec as the feature selection and transformation technique. We use a multi-class Logistic Regression algorithm for our classifier. We are able to achieve an F1 score of 0.96 when classifying a test set of 320 tweets across 9 classes. The AR classifier achieved an F1 score of 0.90 with the same data. Our developed system can classify collections of any size by utilizing a 20 node Hadoop cluster in a parallel fashion, through Spark. Our experiments suggest that the high accuracy score for our classifier can be primarily attributed to the pre-processing and feature selection techniques that we used. Understanding the Twitter universe vocabulary helped us frame the text cleaning and pre-processing rules used to eliminate noise from the text. The Word2Vec feature selection technique helps us capture the word contexts in a low dimensional feature space that results in high classification accuracy and low model training time. Utilizing the Spark framework to execute our classification pipeline in a distributed fashion allows us to classify large collections without running into out-of-memory exceptions.
- CS5604 Fall 2016 Solr Team Project ReportLi, Liuqing; Pillai, Anusha; Wang, Ye; Tian, Ke (Virginia Tech, 2016-12-07)This submission describes the work the SOLR team completed in Fall 2016. It includes the final report and presentation, as well as key relevant materials (indexing scripts & Java code). Based on the work in Spring 2016, the SOLR team improved the general search infrastructure supporting the IDEAL and GETAR projects, both funded by NSF. The main responsibility was to configure the Basic Indexing and Incremental Indexing (Near Real Time, NRT Indexing) for tweets and web page collections in DLRL's Hadoop Cluster. The goal of Basic Indexing was to index the big collection that contains more than 1.2 billion tweets. The idea of NRT Indexing was to monitor real-time changes in HBase and update the Solr results as appropriate. The main motivation behind the Custom Ranking was to design and implement a new scoring function to re-rank the retrieved results in Solr. Based on the text similarity, a basic document recommender was also created to retrieve the similar documents related to a specific one. Finally, new well written manuals could be easier for users and developers to read and get familiar with Solr and its relevant tools. Throughout the semester we closely collaborated with the Collection Management Tweets (CMT), Collection Management Webpages (CMW), Classification (CLA), Clustering and Topic Analysis (CTA), and Front-End (FE) teams in getting requirements, input data, and suggestions for data visualization.
- CS5604 Fall 2017 Classification Team SubmissionAzizi, Ahmadreza; Mulchandani, Deepika; Naik, Amit; Ngo, Khai; Patil, Suraj; Vezvaee, Arian; Yang, Robin (Virginia Tech, 2018-01-03)This project submission includes the work of the 'Classification' team of the CS5604 'Information Storage and Retrieval' course of Fall 2017 towards the GETAR project. Classification of the GETAR data would allow users to analyze, visualize, and explore content related to crises, disasters, human rights, inequality, population growth, shootings, violence, etc. Binary classification models were trained for different events for both tweet and webpage collections. Word2Vec was used as the feature selection technique and the Word2Vec model was trained on the entire corpus available. Logistic Regression was used as our classification technique. As part of this submission, we detail our classification framework and the experiments that we conducted. We also give an insight into the challenges we faced, how we overcame those challenges, and also what we learned in the process. We also provide the code that we implemented and the models that were built to classify 1,562,215 tweets and 4,366 webpages.
- CS5604 Fall 2017 Clustering and Topic AnalysisBaghudana, Ashish; Ahuja, Aman; Bellam, Pavan; Chintha, Rammohan; Sambaturu, Pratyush; Malpani, Ashish; Shetty, Shruti; Yang, Mo (Virginia Tech, 2018-01-13)One of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to build a pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETAR project, in collaboration with the Internet Archive, aims to develop an archive of webpages and tweets related to multiple events and trends that occur in the world, and develop a retrieval system to extract information from that archive. Since it is practically impossible to manually look through all the documents in a large corpus, an important component of any retrieval system is a module that is able to group and summarize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build this component for the GETAR project. Our report examines the various techniques underlying clustering and topic analysis, discusses technology choices and implementation details, and, describes the results of the k-means algorithm and latent Dirichlet allocation (LDA) on different collections of webpages and tweets. Subsequently, we provide a developer manual to help set up our framework, and finally, outline a user manual describing the fields that we populate in HBase.
- CS5604 Fall 2020: Electronic Thesis and Dissertation (ETD) TeamFan, Jiahui; Hardy, Nicolas; Furman, Samuel; Manzoor, Javaid; Nguyen, Alexander; Raghuraman, Aarathi (Virginia Tech, 2020-12-16)The Fall 2020 CS 5604 (Information Storage and Retrieval) class, led by Dr. Edward Fox, is building an information retrieval and analysis system that supports electronic theses and dissertations, tweets, and webpages. We are the Electronic Thesis and Dissertation Collection Management (ETD) team. The Virginia Tech Library maintains a corpus of 19,779 theses and 14,691 dissertations within the VTechWorks system. Our task was to research this data, implement data preprocessing and extraction techniques, index the data using Elasticsearch, and use machine learning techniques for each ETD's classification. These efforts were made in concert with teams working to process other content areas, and build content agnostic infrastructure. Prior work towards these tasks had been done in previous semesters of CS5604, and by students advised by Dr. Fox. That prior work serves as a foundation for our own work. Items of note were the works of Sampanna Kahu, Palakh Jude, and the Fall 2019 CS5604 CME team, which have been used either as part of our pipeline, or as the starting point for our work. Our team divided the creation of an ETD IR system into five subtasks: verify metadata of past teams, ingest text, index using Elasticsearch, extract figures and tables, and classify full documents and chapters. Each member of the team was assigned to a role, and a timeline was created to keep everyone on track. Text ingestion was done via ITCore and is accessible through an API. Our team did not perform any base metadata extraction since the Fall 2019 CS5604 CME had already done so, however we did still verify the quality of the data. Verification was done by hand and showed that most of the errors found in metadata from previous semesters were minor, but there were a few errors that could have lead to misclassification. However, since those major errors were few and far between, we decided that given the length of the project we could continue to use this metadata and added improved metadata extraction to our future goals. For figure and text extraction, we incorporated the work of Kahu. For classification, we first implemented the work of Jude, who has done previous work related to chapter classification. In addition, we created another classifier that is more accurate. Both methods are available as accessible APIs. The latter classifier is also available as a microservice. In addition, an Elasticsearch service was created as a single point of contact between the pipeline and Elasticsearch. It acts as the final part of the pipeline; the processing is complete when the document is indexed into Elasticsearch. The final deliverable is a pipeline of containerized services that can be used to store and index ETDs. Staging of the pipeline was handled by the integration team using Airflow and a reasoner engine to control resource management and data flow. The entire pipeline is then accessible via a website created by the frontend team. Given that Airflow defines the application pipeline based on dependencies between services and our chapter extraction service failed to build due to MySQL dependencies, we were unable to deploy an end-to-end Airflow system. All other services have been unit tested on Git Runner, containerized, and deployed to cloud.cs.vt.edu following a CI/CD pipeline. Future work includes expanding the available models for classification; expanding the available options for the extraction of text, figures, tables, and chapters; and adding more features that may be useful to researchers who would be interested in leveraging this pipeline. Another improvement would be to tackle some of the errors in metadata, such as that from previous teams.
- CS5604 Fall 2022 - Team 5 INTShukla, Anmol; Travasso, Aaron; Manogaran, Harish Babu; Sisodia, Pallavi Kishor; Li, Yuze (Virginia Tech, 2022-01-08)The primary objective of the project is to build a state-of-the-art system to search and retrieve relevant information effectively from a large corpus of electronic theses and dissertations. The system is targeted towards documents such as academic textbooks, dissertations and theses where the information available is enormous, compared to websites or blogs, which the conventional search engines are equipped to handle effectively. The entire work involved in developing the system has been divided into five areas such as data management (Team-1, Curator); search and retrieval (Team-2, User); object detection and topic analysis (Team-3, Objects & Topics); language models, classification, summarization and segmentation (Team-4, Classification & Summarization); and lastly integration (Team-5, Integration). The teams and their operations are structured in a way to mirror an environment of a company working on new product development. The Integration (INT) team focuses on one of the important aspects such as setting up work environments with all requirements for the teams, integrating the work done by the other four teams, and deploying suitable Docker containers for seamless operation (workflow) along with maintaining the cluster infrastructure. The INT team archives this distribution of code and containers on the Virginia Tech Docker Container Registry and deploys it on the Virginia Tech CS Cloud. The INT team also guides team evaluations of prospective container components and workflows. Additionally, the team implements continuous integration and continuous deployment to enable seamless integration, building and testing of code as they are developed. Furthermore, the team works on setting up a workflow management system that employs Apache Airflow to automate creating, scheduling, and monitoring of workflows. We have created customized containers for each team based on their individual requirements. We have developed a workflow management system using Apache Airflow that creates and manages workflows to achieve the goals of each team such as indexing, object detection, segmentation, summarization, and classification. We have also implemented a Continuous Integration and Continuous Deployment (CI/CD) pipeline to automatically create, test and deploy the updated image whenever a new push is made to a Git repository. Additionally, we extended our support to other teams in troubleshooting the issues they faced in deployment. Our current cluster statistics (i.e., Kubernetes Resource Definitions) are: 45 deployments, 40 ingresses, 39 pods, 180 services, and 13 secrets. Lastly, the INT team would like to express its gratitude to the work of the INT-2020 team and the predecessors who have done substantial work upon which we built. We would like to acknowledge here their significant contribution.
- CS5604 Information Storage and Retrieval Fall 2017 Solr ReportKumar, Abhinav; Bangad, Anand; Robertson, Jeff; Garg, Mohit; Ramesh, Shreyas; Mi, Siyu; Wang, Xinyue; Wang, Yu (Virginia Tech, 2018-01-15)The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets and millions of webpages for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. We are using a 21 node Cloudera Hadoop cluster to store and retrieve this information. One goal of this project is to expand the data collection to include more web archives and geospatial data beyond what previously had been collected. Another important part in this project is optimizing the current system to analyze and allow access to the new data. To accomplish these goals, this project is separated into 6 parts with corresponding teams: Classification (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Front-end (FE), and SOLR. The report describes the work completed by the SOLR team which improves the current searching and storage system. We include the general architecture and an overview of the current system. We present the part that Solr plays within the whole system with more detail. We talk about our goals, procedures, and conclusions on the improvements we made to the current Solr system. This report also describes how we coordinate with other teams to accomplish the project at a higher level. Additionally, we provide manuals for future readers who might need to replicate our experiments. The main components within the Cloudera Hadoop cluster that the SOLR team interacts with include: Solr searching engine, HBase database, Lily indexer, Hive database, HDFS file system, Solr recommendation plugin, and Mahout. Our work focuses on HBase design, data quality control, search recommendations, and result ranking. Overall, throughout the semester, we have processed 12,564 web pages and 5.9 million tweets. In order to cooperate with Geo Blacklight, we make major changes on the Solr schema. We also function as a data quality control gateway for the Front End team and deliver the finalized data for them. As to search recommendation, we provide search recommendation such as the MoreLikeThis plugin within Solr for recommending related records from search results, and a custom recommendation system based on user behavior to provide user based search recommendations. After the fine tuning over the final weeks of semester, we successfully allowed effective connection of results from data provided by other teams, and delivered them to the front end through a Solr core.
- CS5604: Information and Storage Retrieval Fall 2016 - CMT (Collection Management Tweets)Wagner, Mitchell J.; Abidi, Faiz; Fan, Shuangfei (Virginia Tech, 2016-12-08)As the Collection Management Tweets team in the Fall 2016 CS5604 class, we were responsible for processing >1.2 billion tweets, including data transfer, noise reduction, tweet augmentation, and storage via several technologies. Our work was the first step in a pipeline that included many teams and ultimately culminated in a comprehensive information retrieval system. We were also responsible for building a social network (or set of networks) for those tweets, along with their tweeters. In this report, we detail our experience with this project. Additionally, we propose solutions for transferring incremental database updates from MySQL to HDFS and subsequently to HBase, derive a graph structure and relationships from entities identified in tweet collections, and offer a query-independent method for estimating the importance of those entities. We achieve these goals through the use of several open-source software packages, and present open, scalable solutions addressing the objectives we were given.
- CS5604: Information and Storage Retrieval Fall 2017 - FE (Front-End Team) Chon, Jieun; Wang, Haitao; Bian, Yali; Niu, Shuo (Virginia Tech, 2017-12-24)Social media and Web data are becoming important sources of information for researchers to monitor and study global events. GETAR, led by Dr. Edward Fox, is a project aiming to collect, organize, browse, visualize, study, analyze, summarize, and explore content and sources related to biodiversity, climate change, crises, disasters, elections, energy policy, environmental policy/planning, geospatial information, green engineering, human rights, inequality, migrations, nuclear power, population growth, resiliency, shootings, sustainability, violence, etc. The report introduces the work of the Front End (FE) team analyzing users' requirements and building user interfaces for people to explore tweet/webpage data. The work of the FE team highly relies on the results from other teams. Our duty includes presenting the collected tweets/webpages, visualizing the clusters and topics, showing the indexed and clustered search results, and last but not least allowing users to perform customized queries and exploration. Therefore the team needs to consider how other teams collect and manage the data, as well as how people utilize the information to gain insights from the data repository. Throughout Fall 2017, our team aims to bridge the data archive and users’ need, focusing on providing various user interfaces for tweet/webpage exploration and analysis. Overall, two main user interfaces are designed and implemented throughout the semester. (1) A visualization-based analytical tool for people to create categories by searching and interacting with filtering tools, which are presented in visualizations such as bar-chart, tag cloud, and node-link graph. (2) A geo-based interface for location-based information, implemented with GeoBlacklight, enabling users to view tweets/webpages on maps. This report documents the background, plans, schedule, design, implementation, software installation, and other related useful information. We used Solr and a triple-store to provide data, and the "getar-cs5604f17-final_shard1_replica1" collection was used in the final testing and delivery. An overview of the team work and detailed design and implementation are both provided. We highlight the visualization-based interface and the location-based interface, as they provide visual tools for people to better understand the data collected by all the teams. We seek to provide information on how we extract users' requirements, how user needs are reflected in light of the related literature, and how that leads to the design of the visualization and geo-interface. An installation manual is also detailed, seeking to help other software engineers who will keep working on GETAR to reuse our work.