CS5604: Information Retrieval
Permanent URI for this collection
This collection contains the final projects of the students in various offerings of the course Computer Science 5604: Information Retrieval. This course is taught by Professor Ed Fox.
Analyzing, indexing, representing, storing, searching, retrieving, processing and presenting information and documents using fully automatic systems. The information may be in the form of text, hypertext, multimedia, or hypermedia. The systems are based on various models, e.g., Boolean logic, fuzzy logic, probability theory, etc., and they are implemented using inverted files, relational thesauri, special hardware, and other approaches. Evaluation of the systems' efficiency and effectiveness.
Browse
Browsing CS5604: Information Retrieval by Title
Now showing 1 - 20 of 57
Results Per Page
Sort Options
- Analyzing and Visualizing Disaster Phases from Social Media StreamsLin, Xiao; Chen, Liangzhe; Wood, Andrew (2012-12-11)Working under the direction of CTRNet, we developed a procedure for classifying Twitter data related to natural/man-made disasters into one of the Four Phases of Emergency Management (response, recovery, mitigation, and preparedness) as well as a multi-view system for visualizing the resulting data.
- CINET GDS-Calculator: Graph Dynamical Systems VisualizationWu, Sichao; Zhang, Yao (2012-12-10)This report summarizes the project of Graph Dynamical Systems Visualization, which is a subproject under the umbrella of project CINET. Base on some input information, we extract the character of system dynamics and output corresponding diagrams and charts that reflect the dynamical properties of the system, so that it can provide an intuitive and easy way for the researchers to analyze the GDSs. In the introduction section, some background information about the graph dynamical systems and their applications are given in the introduction. Then, we present the requirement analysis, including the task of the project, as well as the challenge we met. Next, this report records the system developing process, including the workflow, user’s manual, developer’ manual, etc. Finally, we summarize the future work. This report can serve as a combination of user’s manual, developer’s manual, and system manual.
- CINETGraphCrawl - Constructing graphs from blogsKaw, Rushi; Subbiah, Rajesh; Makkapati, Hemanth (2012-12-11)Internet forums, weblogs, social networks, and photo and video sharing websites are some forms of social media that are at the forefront of enabling communication among individuals. The rich information captured in social media has enabled a variety behavioral research assisting domains such as marketing, finance, public health and governance etc. Furthermore, social media is believed to be capable of providing valuable insights into understanding information diffusion phenomena such as social influence, opinion formation and rumor spread. Here, we propose a semi-automated approach with prototype implementation that constructs interaction graphs to enable such behavioral studies. We construct first and second degree interaction graphs from Stackoverflow, a programming forum, and CNN Political Ticker, a political news blog.
- Classification Project in CS5604, Spring 2016Bock, Matthew; Cantrell, Michael; Shahin, Hossameldin L. (2016-05-04)In the grand scheme of a large Information Retrieval project, the work of our team was that of performing text classification on both tweet collections and their associated webpages. In order to accomplish this task, we sought to complete three primary goals. We began by performing research to determine the best way to extract information that can be used to represent a given document. Following that, we worked to determine the best method to select features and then construct feature vectors. Our final goal was to use the information gathered previously to build an effective way to classify each document in the tweet and webpage collections. These classifiers were built with consideration of the ontology developed for the IDEAL project. To truly show the effectiveness of our work at accomplishing our intended goals, we also provide an evaluation of our methodologies. The team assigned to perform this classification work last year researched various methods and tools that could be useful in accomplishing the goals we have set forth. Last year’s team developed a system that was able to accomplish similar goals to those we have set forth with a promising degree of success. Our goal for this year was to improve upon their successes using new technologies such as Apache Spark. Spark has provided us with the tools needed to build a well optimized system capable of working with the provided small collections of tweets and webpages in a fast and efficient manner. Spark is also very scalable, and based on our results with the small collections we have confidence in the performance of our system on larger collections. Also included in this submission is our final presentation of the project as presented to the CS5604 class, professor, and GRAs. The presentation provides a high level overview of the project requirements and our approach to them, as well as details about our implementation and evaluation. The submission also includes our source code, so that future classes can expand on the work we have done this semester.
- Classification of Arabic DocumentsElbery, Ahmed (2012-12-19)Arabic language is a very rich language with complex morphology, so it has a very different and difficult structure than other languages. So it is important to build an Arabic Text Classifier (ATC) to deal with this complex language. The importance of text or document classification comes from its wide variety of application domains such as text indexing, document sorting, text filtering, and Web page categorization. Due to the immense amount of Arabic documents as well as the number of internet Arabic language users, this project aims to implement an Arabic Text-Documents Classifier (ATC).
- Classification Team Project for IDEAL in CS5604, Spring 2015Cui, Xuewen; Tao, Rongrong; Zhang, Ruide (2015-05-10)Given the tweets from the instructor and cleaned webpages from the Reducing Noise team, the planned tasks for our group were to find the best: (1) way to extract information that will be used for document representation; (2) feature selection method to construct feature vectors; and (3) way to classify each document into categories, considering the ontology developed in the IDEAL project. We have figured out an information extraction method for document representation, feature selection method for feature vector construction, and classification method. The categories will be associated with the documents, to aid searching and browsing using Solr. Our team handles both tweets and webpages. The tweets and webpages come in the form of text files that have been produced by the Reducing Noise team. The other input is a list of the specific events that the collections are about. We are able to construct feature vectors after information extraction and feature selection using Apache Mahout. For each document, a relational version of the raw data for an appropriate feature vector is generated. We applied the Naïve Bayes classification algorithm in Apache Mahout to generate the vector file and the trained model. The classification algorithm uses the feature vectors to go into classifiers for training and testing that works with Mahout. However, Mahout is not able to predict class labels for new data. Finally we came to a solution provided by Pangool.net, which is a Java, low-level MapReduce API. This package provides us a MapReduce Naïve Bayes classifier that can predict class labels for new data. After modification, this package is able to read in and output to AVRO file in HDFS. The correctness of our classification algorithms, using 5-fold cross-validation, was promising.
- Clustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016Bartolome, Abigail; Islam, M. D.; Vundekode, Soumya (Virginia Tech, 2016-12-08)The IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects aim to build a robust Information Retrieval (IR) system by retrieving tweets and webpages from social media and the World Wide Web, and indexing them to be easily retrieved and analyzed. The project has been divided into different segments - Classification (CLA), Collection Management (tweets - CMT and webpages - CMW), Clustering and Topic Analysis (CTA), SOLR, and Front-End (FE). In building IR systems, documents are scored for relevance. To assist in determining a document’s relevance to a query, it is useful to know what topics are associated with the documents and what other documents relate to it. We, as the CTA team, used topic analysis and clustering techniques to aid in building this IR system. Our contributions were useful in scoring which documents are most relevant to a user’s query. We ran clustering and topic analysis algorithms on collections of tweets and webpages to identify the most discussed topics and grouped them into clusters along with their respective probabilities. We also labeled the topics and clusters, aiming for intuitive labels. The report and presentation cover the background, requirements, design and implementation of our contributions to this project. We evaluated the quality of our methodologies and describe improvements or future work that could be done to extend our project. Furthermore, we include a user manual and a developer manual to assist in any future work that may come from our efforts.
- Collaborative Filtering for IDEALLi, Tianyi; Nakate, Pranav; Song, Ziqian (2016-05-04)The students of CS5604 (Information Retrieval and Storage), have been building an Information Retrieval System based on tweet and webpage collections of the Digital Library Research Laboratory (DLRL). The students have been grouped into smaller teams such as Front End team, Solr team, and Collaborative Filtering team, which are building the individual subsystems of the entire project. The teams are collaborating among themselves to integrate their individual subsystems. The Collaborative Filtering (CF) team has been building a recommendation system that can recommend tweets and webpages to users based on content similarity of document pairs as well as user pair similarity. We have finished building the recommendation system so that when the user starts using the system they will be recommended to documents that are similar to those returned by their queries. As more users coming in, they will be also referred to documents that similar users were interested in.
- Collection Management for IDEALMa, Yufeng; Nan, Dong (2016-05-04)The collection management portion of the information retrieval system has three major tasks. The first task is to perform incremental update of the new data flow from the tweet MySQL database to HDFS and then to HBase. Secondly, for the raw tweets coming into HBase, we are supposed to clean them. Duplicated URLs should be discarded. Also important is to conduct noise reduction. Finally, for the cleaned tweets and webpages, we should do Named Entity Recognition (NER), from which we extract out the information like person, organization, and location names. First, based on existing data flow from the tweet MySQL database to HBase in the IDEAL system, we developed a Sqoop script to import new tweets from MySQL to HDFS. Then another Pig script is run to transfer them into HBase. Afterwards, for raw tweets in HBase, we run a noise reduction module to remove non-ASCII characters, extract hashtags, mentions and URLs from tweet text. Similar procedures were also performed for raw webpage records provided by the GRAs for this project. All the cleaned data for the 6 small collections have been uploaded into HBase with pre-defined schemas documented in this report. Then all the other teams like classification and clustering can consume our cleaned data. Besides what has been done so far, it is desirable to do NER, which tries to extract structured information such as person, organization and location from unstructured text. But due to time limitations, this must be relegated to future work. Also needed is automating the webpage crawling and cleaning processes, which are essential after incremental update. That would expand URLs extracted from tweets in HBase first, and then crawl the corresponding webpages after invalid URL removal. Finally, extracted useful information in webpages would be stored into HBase.
- Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019Kaushal, Kulendra Kumar; Kulkarni, Rutwik; Sumant, Aarohi; Wang, Chaoran; Yuan, Chenhan; Yuan, Liling (Virginia Tech, 2019-12-23)The class ``CS 5604: Information Storage and Retrieval'' in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD corpus consists of 14,055 doctoral dissertations and 19,246 masters theses from Virginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text analytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc. The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF files from the ETD corpus and extracting well-formatted text files from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing different parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We finally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correctness of the data that flows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch.
- Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019Muhundan, Sushmethaa; Bendelac, Alon; Zhao, Yan; Svetovidov, Andrei; Biswas, Debasmita; Marin Thomas, Ashin (Virginia Tech, 2019-12-11)Consumption of tobacco causes health issues, both mental and physical. Despite this widely known fact, tobacco companies had sustained their huge presence in the market over the past century owing to a variety of successful marketing strategies. This report documents the work of the Collection Management Tobacco Settlement Documents (CMT) team, the data ingestion team for the tobacco documents. We deal with an archive of tobacco documents that were produced during litigation between the United States and seven major tobacco industry organizations. Our aim is to process these documents and assist Dr. David M. Townsend, an assistant professor at Virginia Polytechnic Institute and State University (Virginia Tech) Pamplin College of Business, in his research towards understanding the marketing strategies of the tobacco companies. The team is part of a larger initiative: to build a state-of-the-art information retrieval and analysis system. We handle over 14 million tobacco settlement documents as part of this project. Our tasks include extracting the data as well as metadata from these documents. We cater to the needs of the ElasticSearch (ELS) team and the Text Analytics and Machine Learning (TML) team. We provide tobacco settlement data in suitable formats to enable them to process and feed the data into the information retrieval system. We have successfully processed both the metadata and the document texts into a usable format. For metadata, this involved collaborating with the above-mentioned teams to come up with a suitable format. We retrieved the metadata from a MySQL database and converted it into a JSON for Elasticsearch ingestion. For the data, this involved lemmatization, tokenization, and text cleaning. We have supplied the entire dataset to the ELS and TML teams. Data, as well as metadata of these documents, were cleaned and provided. Python scripts were used to query the database and output the results in the required format. We also closely interacted with Dr. Townsend to understand his research needs in order to guide the Front-end and Kibana (FEK) team in terms of insights about features that can be used for visualizations. This way, the information retrieval system we build would add more value to our client.
- Collection Management Tweets Project Fall 2017Khaghani, Farnaz; Zeng, Junkai; Bhuiyan, Momen; Tabassum, Anika; Bandyopadhyay, Payel (Virginia Tech, 2018-01-17)The report included in this submission documents the work by the Collection Management Tweets (CMT) team, which is a part of the bigger effort in CS5604 on building a state-of-the-art information retrieval and analysis system for the IDEAL (Integrated Digital Event Archiving and Library) and GETAR (Global Event and Trend Archive Research) projects. The mission of the CMT team had two parts: 1) Cleaning 6.2 million tweets from two 2017 event collections named "Solar Eclipse" and "Las Vegas Shooting", and loading them into HBase, an open source, non-relational, distributed database that runs on the Hadoop distributed file system, in support of further use; and 2) Building and storing a social network for the tweet data using a triple-store. For the first part, our work included: A) Making use of the work done by the previous year's class group, where incremental update was done, to introduce a faster development process of data collection and storing; B) Improving the performance of work done by the group from last year. Previously, the cleaning part, e.g., removing profanity words, plus extracting hashtags and mentions, utilized Python. This becomes very slow when the dataset scales up. We introduced parallelization in our tweet cleaning process with the help of Scala and the Hadoop cluster, and made use of different Natural Language Processing libraries for stop word and profanity removal; C) Along with tweet cleaning we also identified and stored Named-Entity-Recognition (NER) entries and Part-of-speech (POS) tags, with the tweets which was not done by the previous team. The cleaned data in HBase from this task is provided to the Classification team for spam detection and to the Clustering and Topic Analysis team for topic analysis. Collection Management Webpage team uses the extracted URLs from the tweets for further processing. Finally, after the data is indexed by the SOLR team, the Front-End team visualizes the tweets to users, and provides access for searching and browsing. In addition to the aforementioned tasks, our responsibilities also included building a network of tweets. This entailed doing research into the types of database that are appropriate for this graph. For storing the network, we used a triple-store database to record different types of edges and relationships in the graph. We also researched methods ascribing importance to nodes and edges in our social networks once they were constructed, and analyzed our networks using these techniques.
- Collection Management WebpagesEagan, Mackenzie; Liang, Xiao; Michael, Louis; Patil, Supritha (Virginia Polytechnic Institute and State University, 2017-12-25)The Collection Management Webpages team is responsible for collecting, processing, and storing webpages from different sources. Our team worked on familiarizing ourselves with the necessary tools and data required to produce the specified output that was used by other teams in this class (Fall 2017 CS 5604). Input includes URLs generated by the Event Focused Crawler (EFC), URLs obtained from tweets by the Collection Management Tweets team, and webpage content from Web Archive (WARC) files from the Internet Archive or other sources. Our team fetches raw HTML from the obtained URLs and extracts HTML from WARC files. From this raw data, we obtain metadata information about the corresponding webpage. The raw data is also cleaned and processed for other teams' consumption. This processing is accomplished using various Python libraries. The cleaned information is made available in a variety of formats, including tokens, stemmed or lemmatized text, and text tagged with parts of speech. Both the raw and processed webpage data are stored in HBase and intermediately in HDFS (Hadoop Distributed File System). Our team successfully executed all individual portions of our proposed process. We successfully ran the EFC and obtained URLs from these runs. Using these URLs, we created WARC files. We obtained the raw HTML, extracted metadata information from it, and cleaned and processed the webpage information before uploading it to HBase. We iteratively expanded on the functionality of our cleaning and processing scripts in order to provide more relevant information to other groups. We processed and cleaned information from WARC files provided by the instructor in a similar manner. We have acquired webpage data from URLs obtained by the Collection Management Tweets (CMT) team. At this time however, there is no end-to-end process in place. Due to the volume of data our team has been dealing with, we explored various methods for parallelizing and speeding up our processes. Our team used the PySpark library for obtaining information from URLs and the multiprocessing library in Python for processing information stored in WARC files.
- Collection Management Webpages - Fall 2016 CS5604Dao, Tung; Wakeley, Christopher; Weigang, Liu (Virginia Tech, 2017-03-23)The Collection Management Webpages (CMW) team is responsible for collecting, processing and storing webpages from different sources including tweets from multiple collections and contributors, such as those related to events and trends studied in local projects like IDEAL/GETAR, and webpage archives collected by Pranav Nakate, Mohamed Farag, and others. Thus, based on webpage sources, we divide our work into the three following deliverable and manageable tasks. The first task is to fetch the webpages mentioned in the tweets that are collected by the Collection Management Tweets (CMT) team. Those webpages are then stored in WARC files, processed, and loaded into HBase. The second task is to run focused crawls for all of the events mentioned in IDEAL/GETAR to collect relevant webpages. And similar to the first task, we would then store the webpages into WARC files, process them, and load them into HBase. We also plan to achieve the third task which is similar to the first two, except that the webpages are from archives collected by the people previously involved in the project. Since these tasks are time-consuming and sensitive to real-time processing requirements, it is essential that our approach be incremental, meaning that webpages need to be incrementally collected, processed, and stored to HBase. We have conducted multiple experiments for the first, second, and third tasks, on our local machines as well as the cluster. For the second task, we manually collected a number of seed URLs of events, namely “South China Sea Disputes”, “USA President Election 2016”, and “South Korean President Protest”, to train the focused event crawler, and then ran the trained model on a small number of URLs that are randomly generated as well as manually collected. Encouragingly, these experiments ran successfully; however, we still have to work to scale up the experimenting data to be systematically run on the cluster. The two main components to be further improved and tested are the HBase data connector and handler, and the focused event crawler. While focusing on our own tasks, the CMW team works closely with other teams whose inputs and outputs depend on our team. For example, the front-end (FE) team might use our results for their front-end content. We discussed with the Classification (CLA) team to have some agreements on filtering and noise reducing tasks. Also, we made sure that we would get the right format URLs from the Collection Management Tweets (CMT) team. In addition, the other two teams, Clustering and Topic Analysis (CTA) and SOLR, will use our team’s outputs for topic analyzing and indexing, respectively. For instance, based on the SOLR team’s requests and consensus, we have finalized a schema (i.e., specific fields of information) for a webpage to be collected and stored. In this final report, we report our CMW team’s overall results and progress. Essentially, this report is a revised version of our three interim reports based on Dr. Fox’s and peer-reviewers’ comments. Besides to this revising, we continue reporting our ongoing work, challenges, processes, evaluations, and plans.
- CS 5604 2020: Information Storage and Retrieval TWT - Tweet Collection Management TeamBaadkar, Hitesh; Chimote, Pranav; Hicks, Megan; Juneja, Ikjot; Kusuma, Manisha; Mehta, Ujjval; Patil, Akash; Sharma, Irith (Virginia Tech, 2020-12-16)The Tweet Collection Management (TWT) Team aims to ingest 5 billion tweets, clean this data, analyze the metadata present, extract key information, classify tweets into categories, and finally, index these tweets into Elasticsearch to browse and query. The main deliverable of this project is a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects. As a starting point, we focused on two development goals: (1) hashtag-based and (2) username-based search for tweets. For IR1, we completed extraction of two fields within our sample collection: hashtags and username. Sample code for TwiRole, a user-classification program, was investigated for use in our project. We were able to sample from multiple collections of tweets, spanning topics like COVID-19 and hurricanes. Initial work encompassed using a sample collection, provided via Google Drive. An NFS-based persistent storage was later involved to allow access to larger collections. In total, we have developed 9 services to extract key information like username, hashtags, geo-location, and keywords from tweets. We have also developed services to allow for parsing and cleaning of raw API data, and backup of data in an Apache Parquet filestore. All services are Dockerized and added to the GitLab Container Registry. The services are deployed in the CS cloud cluster to integrate services into the full search engine workflow. A service is created to convert WARC files to JSON for reading archive files into the application. Unit testing of services is complete and end-to-end tests have been conducted to improve system robustness and avoid failure during deployment. The TWT team has indexed 3,200 tweets into the Elasticsearch index. Future work could involve parallelization of the extraction of metadata, an alternative feature-flag approach, advanced geo-location inference, and adoption of the DMI-TCAT format. Key deliverables include a data body that allows for search, sort, filter, and visualization of raw tweet collections and metadata analysis; a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects; and a user guide to assist those using the system.
- CS 5604 INFORMATION STORAGE AND RETRIEVAL Front-End Team Fall 2016 Final ReportKohler, Rachel; Tasooji, Reza; Sullivan, Patrick (Virginia Tech, 2016-12-08)Information Retrieval systems are a common tool for building research and disseminating knowledge. For this to be possible, these systems must be able to effectively show varying amounts of relevant information to the user. The information retrieval system is in constant interaction with the user, who can modify the direction of their search as they gain more information. The front-end of the information retrieval system is where this important communication happens. As members of Dr. Fox's class on Information Storage and Retrieval, we are tasked with understanding and making progress toward answering the question: how can we best build a state-of-the-art information retrieval and analysis system in support of the IDEAL (Integrated Digital Event Archiving and Library) and GETAR (Global Event and Trend Archive Research) projects? As the front-end design and development team, our responsibility to this project is in creating an interface for users to explore large collections of tweet and webpage data. Our goal in this research effort is to understand how users search for information and to support these efforts with an accurate and usable interface. We support various methods of searching, such as query driven searches, faceted search and browsing, and filtering of information by topic. We implemented user management and logging to support future work in recommendations. Additionally, we integrated a framework for future efforts in providing users with insightful visualizations which will allow them to explore social network and document interrelation data.
- CS 5604: Information Storage and Retrieval - Webpages (WP) TeamBarry-Straume, Jostein; Vives, Cristian; Fan, Wentao; Tan, Peng; Zhang, Shuaicheng; Hu, Yang; Wilson, Tishauna (Virginia Tech, 2020-12-18)The first major goal of this project is to build a state-of-the-art information retrieval engine for searching webpages and for opening up access to existing and new webpage collections resulting from Digital Library Research Laboratory (DLRL) projects relating to eventsarchive.org. The task of the Webpage (WP) team was to provide the functionality of making any archived webpage accessible and indexed. The webpages can be obtained either through event focused crawlers or collections of data, such as WARC files containing webpages, or sets of tweets which contains embedded URLs. Toward completion of the project, the WP team worked on four major tasks: 1.) Contents of WARC files searchable through ElasticSearch. 2.) Contents of WARC files cleaned and searchable through ElasticSearch. 3.) Event focused crawler running and producing WARC files. 4.) Additional extracted/derived information (e.g., dates, classes) made searchable. The foundation of the software is a Docker container cluster employing Airflow, a Reasoner, and Kubernetes. The raw data of the information content of the given webpage collections is stored using the Network File System (NFS), while Ceph is used for persistent storage for the Docker containers. Retrieval, analysis, and visualization of the webpage collection is carried out with ElasticSearch and Kibana, respectively. These two technologies form an Elastic Stack application which serves as the vehicle with which the WP team indexes, maps, and stores the processed data and model outputs with regards to webpage collections. The software is co-designed by 7 team members of Virginia Tech graduate students, all members of the same computer science class, CS 5604: Information Storage and Retrieval. The course is taught by Professor Edward A. Fox. Dr. Fox structures the class in a way for his students to perform in a “mock” business development setting. In other words, the academic project submitted by the WP team for all intents and purposes can be viewed as a microcosm of software development within a corporate structure. This submission focuses on the work of the WP team, which creates and administers Docker containers such that various services are tested and deployed in whole. Said services pertain solely to the ingestion, cleansing, analysis, extraction, classification, and indexing of webpages and their respective content.
- CS5604 (Information Retrieval) Fall 2020 Front-end (FE) Team ProjectCao, Yusheng; Mazloom, Reza; Ogunleye, Makanjuola (Virginia Tech, 2020-12-16)With the demand and abundance of information increasing over the last two decades, generations of computer scientists are trying to improve the whole process of information searching, retrieval, and storage. With the diversification of the information sources, users' demand for various requirements of the data has also changed drastically both in terms of usability and performance. Due to the growth of the source material and requirements, correctly sorting, filtering, and storing has given rise to many new challenges in the field. With the help of all four other teams on this project, we are developing an information retrieval, analysis, and storage system to retrieve data from Virginia Tech's Electronic Thesis and Dissertation (ETD), Twitter, and Web Page archives. We seek to provide an appropriate data research and management tool to the users to access specific data. The system will also give certain users the authority to manage and add more data to the system. This project's deliverable will be combined with four others to produce a system usable by Virginia Tech's library system to manage, maintain, and analyze these archives. This report attempts to introduce the system components and design decisions regarding how it has been planned and implemented. Our team has developed a front end web interface that is able to search, retrieve, and manage three important content collection types: ETDs, tweets, and web pages. The interface incorporates a simple hierarchical user permission system, providing different levels of access to its users. In order to facilitate the workflow with other teams, we have containerized this system and made it available on the Virginia Tech cloud server. The system also makes use of a dynamic workflow system using a KnowledgeGraph and Apache Airflow, providing high levels of functional extensibility to the system. This allows curators and researchers to use containerised services for crawling, pre-processing, parsing, and indexing their custom corpora and collections that are available to them in the system.
- CS5604 Fall 2016 Classification Team Final ReportWilliamson, Eric R.; Chakravarty, Saurabh (Virginia Tech, 2016-12-08)Content is generated on the Web at an exponential rate. The type of content varies from text on a traditional webpage to text on social media portals (e.g., social network sites and microblogs). One such example of social media is the microblogging site Twitter. Twitter is known for its high level of activity during live events, natural disasters, and events of global importance. Improving text classification results on Twitter data would pave the way to categorize the tweets into human defined real world events. This would allow diverse stakeholder communities to interactively collect, organize, browse, visualize, analyze, summarize, and explore content and sources related to crises, disasters, human rights, inequality, population growth, resiliency, shootings, sustainability, violence, etc. Challenges with the data in the Twitter universe include that the text length is limited to 160 characters. Because of this limitation, the vocabulary in the Twitter universe has taken its own form of short abbreviations of sentences, emojis, hashtags, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on tweets. Sophisticated text processing techniques like cleaning, lemmatizing, and removal of stop words and special characters will give us clean text which can be further processed to derive richer word semantic and syntactic relationships using state of the art feature selection techniques like Word2Vec. Machine learning techniques using word features that capture semantic and context relationships have been shown to give state of the art classification accuracy. To check the efficacy of our classifier, we would compare our experimental results with an association rules (AR) classifier. This classifier composes its rules around the most discriminating words in the training data. The hierarchy of rules along with an ability to tune to the support threshold makes it an effective classifier for scenarios where short text is involved. We developed a system where we read the tweets from HBase and write the classification label back after the classification step. We use domain oriented pre-processing on the tweets, and Word2Vec as the feature selection and transformation technique. We use a multi-class Logistic Regression algorithm for our classifier. We are able to achieve an F1 score of 0.96 when classifying a test set of 320 tweets across 9 classes. The AR classifier achieved an F1 score of 0.90 with the same data. Our developed system can classify collections of any size by utilizing a 20 node Hadoop cluster in a parallel fashion, through Spark. Our experiments suggest that the high accuracy score for our classifier can be primarily attributed to the pre-processing and feature selection techniques that we used. Understanding the Twitter universe vocabulary helped us frame the text cleaning and pre-processing rules used to eliminate noise from the text. The Word2Vec feature selection technique helps us capture the word contexts in a low dimensional feature space that results in high classification accuracy and low model training time. Utilizing the Spark framework to execute our classification pipeline in a distributed fashion allows us to classify large collections without running into out-of-memory exceptions.
- CS5604 Fall 2016 Solr Team Project ReportLi, Liuqing; Pillai, Anusha; Wang, Ye; Tian, Ke (Virginia Tech, 2016-12-07)This submission describes the work the SOLR team completed in Fall 2016. It includes the final report and presentation, as well as key relevant materials (indexing scripts & Java code). Based on the work in Spring 2016, the SOLR team improved the general search infrastructure supporting the IDEAL and GETAR projects, both funded by NSF. The main responsibility was to configure the Basic Indexing and Incremental Indexing (Near Real Time, NRT Indexing) for tweets and web page collections in DLRL's Hadoop Cluster. The goal of Basic Indexing was to index the big collection that contains more than 1.2 billion tweets. The idea of NRT Indexing was to monitor real-time changes in HBase and update the Solr results as appropriate. The main motivation behind the Custom Ranking was to design and implement a new scoring function to re-rank the retrieved results in Solr. Based on the text similarity, a basic document recommender was also created to retrieve the similar documents related to a specific one. Finally, new well written manuals could be easier for users and developers to read and get familiar with Solr and its relevant tools. Throughout the semester we closely collaborated with the Collection Management Tweets (CMT), Collection Management Webpages (CMW), Classification (CLA), Clustering and Topic Analysis (CTA), and Front-End (FE) teams in getting requirements, input data, and suggestions for data visualization.
- «
- 1 (current)
- 2
- 3
- »