CS5604: Information Retrieval
Permanent URI for this collection
This collection contains the final projects of the students in various offerings of the course Computer Science 5604: Information Retrieval. This course is taught by Professor Ed Fox.
Analyzing, indexing, representing, storing, searching, retrieving, processing and presenting information and documents using fully automatic systems. The information may be in the form of text, hypertext, multimedia, or hypermedia. The systems are based on various models, e.g., Boolean logic, fuzzy logic, probability theory, etc., and they are implemented using inverted files, relational thesauri, special hardware, and other approaches. Evaluation of the systems' efficiency and effectiveness.
Browse
Browsing CS5604: Information Retrieval by Issue Date
Now showing 1 - 20 of 57
Results Per Page
Sort Options
- Leveraging eXist-db for Efficient TEI Document ManagementSchutt, Kyle; Morgan, Kyle (2012-12-10)Professor David Radcliffe has created Lord Byron and his Times (LBT), a large digital archive of works surrounding Lord Byron and his contemporaries. The original website was unusable slow due to the expensive XSLT transformations and XPath queries being applied to TEI documents. By removing the reliance on XSL and using eXist-db and XQuery, while relying on ubiquitous and well-documented CSS for client-side styling, we have been able to increase performance dramatically without sacrificing features or functionality. In this paper, we go over an overview of the project, including challenges and potential solutions, and performance metrics of the original system in section 1. Section 2 contains a user's manual detailing difference between the old and proposed systems. Section 3 contains a developer's manual, which contains overviews of various technologies that were being used in the system designed by Professor Radcliffe. The fourth section describes technologies relevant to the proposed system. Finally, documentation and installation instructions are given in the fifth section. The rest of the paper contains a VTechWorks inventory, contacts for everyone involved with the LBT project, and references.
- CINET GDS-Calculator: Graph Dynamical Systems VisualizationWu, Sichao; Zhang, Yao (2012-12-10)This report summarizes the project of Graph Dynamical Systems Visualization, which is a subproject under the umbrella of project CINET. Base on some input information, we extract the character of system dynamics and output corresponding diagrams and charts that reflect the dynamical properties of the system, so that it can provide an intuitive and easy way for the researchers to analyze the GDSs. In the introduction section, some background information about the graph dynamical systems and their applications are given in the introduction. Then, we present the requirement analysis, including the task of the project, as well as the challenge we met. Next, this report records the system developing process, including the workflow, user’s manual, developer’ manual, etc. Finally, we summarize the future work. This report can serve as a combination of user’s manual, developer’s manual, and system manual.
- Large Scale Network Visualization with GephiAlam, Maksudul; Arifuzzaman, S. M.; Bhuiyan, Md Hasanuzzaman (2012-12-11)The notion of graphs or networks is sufficiently pervasive since it can be used to model various types of data sources. Social, biological, and other networks capture the underlying structural and relational properties. Analysis of different networks reveals interesting information of the corresponding domain or system. Network analysts, thus, strive to analyze various networks by applying different algorithms and try to connect obtained insights to make sense of a unified theme, pattern or structure. For example, analysis of facebook friend network of a person can reveal information such as, groups of highly clustered people, most influential person in terms of connections, connecting persons between different cluster of people, etc. While analyzing networks and digesting the information therein, analysts gradually form internal mental models of the people, places, events, or any sort of entity represented in the networks. As the number of nodes grows larger, however, it becomes increasingly difficult for an investigator to track the connections between data and make sense of it all. Many researchers believe that visual representations of data can help people better examine, analyze, and understand them. Norman [Norman94] has described how visual representations can help augment people’s thinking and analysis processes. The objective of the project is to develop visual representations of nodes, edges, and labels of a network in order to help analysts search, review, and understand the network better. We seek to create interactive visualizations that will highlight and identify significance of nodes, cluster formation, etc., in the networks where entities may be, for example, people, places, webpage, biological entity, dates and organizations. Basically, we want to build visual representations of the networks that help analysts making sense by applying different algorithms on them and observe the difference of nodes and edges in terms of color, and size. A very important aspect of the project is the integration of the visualization module with CINET [CINET2012], a cyberinfrastructure for network science. CINET includes a set of graph algorithms and various types of networks. Analysis of networks are done by applying algorithms on those networks; results are obtained as text files containing information of different measures of nodes or edges. Complex workflow is intended while working with CINET where output of one analysis can be used as input to further analysis. Visualization comes as a great aid when analyst want to filter his interest on some particular nodes or a portion of the graph and conduct subsequent analysis on the smaller part. Though there are some existing visualization tools, e.g., Jigsaw [Jigsaw08], Sentinel Visualizer, NetLens, etc., they are more focused on information representation rather than on graph exploration or summarization capabilities. To the best of our knowledge, our project is the only one which supports network visualization as a part of complex workflow within network analysis utilizing high performance computing environment. In summary, this project develops a visualization component for a VT digital library containing large network graphs (e.g., social networks and transportation networks). The visualization service will get datasets from an existing DL, visualize the graphs using Gephi (a java-based visualization library), and integrate the results within an NSF supported cyberinfrastructure (CINET).
- Analyzing and Visualizing Disaster Phases from Social Media StreamsLin, Xiao; Chen, Liangzhe; Wood, Andrew (2012-12-11)Working under the direction of CTRNet, we developed a procedure for classifying Twitter data related to natural/man-made disasters into one of the Four Phases of Emergency Management (response, recovery, mitigation, and preparedness) as well as a multi-view system for visualizing the resulting data.
- CINETGraphCrawl - Constructing graphs from blogsKaw, Rushi; Subbiah, Rajesh; Makkapati, Hemanth (2012-12-11)Internet forums, weblogs, social networks, and photo and video sharing websites are some forms of social media that are at the forefront of enabling communication among individuals. The rich information captured in social media has enabled a variety behavioral research assisting domains such as marketing, finance, public health and governance etc. Furthermore, social media is believed to be capable of providing valuable insights into understanding information diffusion phenomena such as social influence, opinion formation and rumor spread. Here, we propose a semi-automated approach with prototype implementation that constructs interaction graphs to enable such behavioral studies. We construct first and second degree interaction graphs from Stackoverflow, a programming forum, and CNN Political Ticker, a political news blog.
- Focused CrawlingFarag, Mohamed Magdy Gharib; Khan, Mohammed Saquib Akmal; Mishra, Gaurav; Ganesh, Prasad Krishnamurthi (2012-12-11)Finding information on WWW is difficult and challenging task because of the extremely large volume of the WWW. Search engine can be used to facilitate this task, but it is still difficult to cover all the webpages on the WWW and also to provide good results for all types of users and in all contexts. Focused crawling concept has been developed to overcome these difficulties. There are several approaches for developing a focused crawler. Classification-based approaches use classifiers in relevance estimation. Semantic-based approaches use ontologies for domain or topic representation and in relevance estimation. Link analysis approaches use text and link structure information in relevance estimation. The main differences between these approaches are: what policy is taken for crawling, how to represent the topic of interest, and how to estimate the relevance of webpages visited during crawling. We present in this report a modular architecture for focused crawling. We separated the design of the main components of focused crawling into modules to facilitate the exchange and integration of different modules. We will present here a classification-based focused crawler prototype based on our modular architecture.
- ProjOpenDSA - OpenDSA Log SupportWei, Shiyi; Suwardiman, Victoria; Swaminathan, Anand (2012-12-11)The OpenDSA project is an online eTextbook project that includes not only literature but other dynamic content to be used in Data Structures and Algorithms courses. OpenDSA contains exercises of various types to go along with the literature in order to provide automated self-assessment for students. What the research team seeks to do is to collect and log data regarding student interactions with these exercises, logging both the students’ performance, such as scores, as well as their interaction with the system, such as timestamps for button clicks. What we did to extend the current OpenDSA project is provide visualizations of the log data in meaningful ways as to be helpful to all users of the system. The OpenDSA Log Support Project, as we have called it, is designed to analyze the log data and provide views for the instructors who teach the course, the students who take the course, as well as for the developers who designed and are continually working on improving the system. Taking the various forms of log data collected from the students in the DSA course of the Fall 2012 semester, we developed three views: the teacher view, student view, and developer view. Each view displays information that is most useful to its user; for example, a comprehensive table of all students, their scores, and their status in each exercise is the most important data that a teacher will be interested in seeing. We developed our views using the Django web framework that the OpenDSA research team is currently using, pulling our data from the database that all of the data gets logged to. Using this data, we then created online views accessible to those with accounts, namely the instructor, students, and developers. Some challenges we ran into include the display of and performance of displaying the data in our views. This came up because of the amount of data logged, proving difficult to find efficient and readable ways to analyze and display the data. Though some solutions have been found, because this project is ongoing, future work include optimizing each view, improving the display of each view, as well as adding additional views for each user.
- Classification of Arabic DocumentsElbery, Ahmed (2012-12-19)Arabic language is a very rich language with complex morphology, so it has a very different and difficult structure than other languages. So it is important to build an Arabic Text Classifier (ATC) to deal with this complex language. The importance of text or document classification comes from its wide variety of application domains such as text indexing, document sorting, text filtering, and Web page categorization. Due to the immense amount of Arabic documents as well as the number of internet Arabic language users, this project aims to implement an Arabic Text-Documents Classifier (ATC).
- Named Entity Recognition for IDEALDu, Qianzhou; Zhang, Xuan (2015-05-10)The term “Named Entity”, which was first introduced by Grishman and Sundheim, is widely used in Natural Language Processing (NLP). The researchers were focusing on the information extraction task, that is extracting structured information of company activities and defense related activities from unstructured text, such as newspaper articles. The essential part of “Named Entity” is to recognize information elements, such as location, person, organization, time, date, money, percent expression, etc. To identify these entities from unstructured text, some researchers called this sub-task of information extraction as “Named Entity Recognition” (NER). Now, NER technology has become mature and there are good tools to implement this task, such as the Stanford Named Entity Recognizer (SNER), Illinois Named Entity Tagger (INET), Alias-i LingPipe (LIPI), and OpenCalasi (OCWS). Each of these has some advantages and is designed for some special data. In this term project, our final goal is to build a NER module for the IDEAL project based on a particular NER tool, such as SNER, to apply NER to the Twitter and web pages data sets. This project report presents our work towards this goal, including literature review, requirements, algorithm, development plan, system architecture, implementation, user manual, and development manual. Further, results are given with regard to multiple collections, along with discussion and plans for the future.
- LDA Team Project in CS5604, Spring 2015: Extracting Topics from Tweets and Webpages for IDEALPumma, Sarunya; Liu, Xiaoyang (2015-05-10)IDEAL or Integrated Digital Event Archiving and Library is a project of Virginia Tech to implement a state-of-the-art event-based information retrieval system. A practice project of CS 5604 Information Retrieval is a part of the IDEAL project. The main objective of this project is to build a robust search engine on top of Solr, a general purpose open-source search engine, and Hadoop, a big data processing platform. The search engine can provide documents, which are tweets and webpages, that are relevant to a query that a user provides. To enhance the performance of the search engine, the documents in the archive have been indexed by various approaches including LDA (Latent Dirichlet Allocation), NER (Name-Entity Recognition), Clustering, Classification, and Social Network Analysis. As CS 5604 is a problem-based learning class, teams are responsible for implementation and development of solutions for each technique. In this report, the implementation of the LDA component is presented. LDA aids extracting collections of topics from the documents. A topic in this context is a set of words that can be used to represent a document. Details of how LDA worked with both small and large collections are described. Once the implementation of the LDA component is integrated with other processing and Solr, we are confident that performance of the information retrieval system of the IDEAL project will be enhanced.
- Classification Team Project for IDEAL in CS5604, Spring 2015Cui, Xuewen; Tao, Rongrong; Zhang, Ruide (2015-05-10)Given the tweets from the instructor and cleaned webpages from the Reducing Noise team, the planned tasks for our group were to find the best: (1) way to extract information that will be used for document representation; (2) feature selection method to construct feature vectors; and (3) way to classify each document into categories, considering the ontology developed in the IDEAL project. We have figured out an information extraction method for document representation, feature selection method for feature vector construction, and classification method. The categories will be associated with the documents, to aid searching and browsing using Solr. Our team handles both tweets and webpages. The tweets and webpages come in the form of text files that have been produced by the Reducing Noise team. The other input is a list of the specific events that the collections are about. We are able to construct feature vectors after information extraction and feature selection using Apache Mahout. For each document, a relational version of the raw data for an appropriate feature vector is generated. We applied the Naïve Bayes classification algorithm in Apache Mahout to generate the vector file and the trained model. The classification algorithm uses the feature vectors to go into classifiers for training and testing that works with Mahout. However, Mahout is not able to predict class labels for new data. Finally we came to a solution provided by Pangool.net, which is a Java, low-level MapReduce API. This package provides us a MapReduce Naïve Bayes classifier that can predict class labels for new data. After modification, this package is able to read in and output to AVRO file in HDFS. The correctness of our classification algorithms, using 5-fold cross-validation, was promising.
- Social Network Project for IDEAL in CS5604Harb, Islam; Jin, Yilong; Cedeno, Vanessa; Mallampati, Sai Ravi Kiran; Bulusu, Bhaskara Srinivasa Bharadwaj (2015-05-11)The IDEAL (Integrated Digital Event Archiving and Library) project involves VT faculty, staff, and students, along with collaborators around the world, in archiving important events and integrating the digital library, and archiving approaches to support the Research and Development related to important events. An objective of the CS5604 (Information Retrieval), Spring 2015 course, was to build a state-of-the-art information retrieval system, in support of the IDEAL project. Students were divided into eight groups to become experts in a specific theme of high importance in the development of the tool. The identified themes were Classifying Types, Extraction and Feature Selection, Clustering, Hadoop, LDA, NER, Reducing Noise, Social Networks and Importance and Solr and Lucene. Our goal as a class was to provide documents that were relevant to an arbitrary user query from within a collection of tweets and their referenced web pages. The goal of the Social Network and Importance group was to develop a query independent importance methodology for these tweets and web pages based on social network type considerations. This report proposes a method to provide importance to the tweets and web pages by using non-content features. We define two features for the ranking, Twitter specific features and Account authority features. To determine the best set of features, the analysis of their individual effect in the output importance is also included. At the end, an “importance” value is associated with each document, to aid searching and browsing using Solr.
- Hadoop Project for IDEAL in CS5604Cadena, Jose; Chen, Mengsu; Wen, Chengyuan (Virginia Tech, 2015-05-11)The Integrated Digital Event Archive and Library (IDEAL) system addresses the need for combining the best of digital library and archive technologies in support of stakeholders who are remembering and/or studying important events. It leverages and extends the capabilities of the Internet Archive to develop spontaneous event collections that can be permanently archived as well as searched and accessed. IDEAL connects the processing of tweets and web pages, combining informal and formal media to support building collections on chosen general or specific events. Integrated services include topic identification, categorization (building upon special ontologies being devised), clustering, and visualization of data, information, and context. The objective for the course is to build a state-of-the-art information retrieval system in support of the IDEAL project. Students were assigned to eight teams, each of which focused on a different part of the system to be built. These teams were Solr, Classification, Hadoop, Noise Reduction, LDA, Clustering, Social Networks, and NER. As the Hadoop team, our focus is on making the information retrieval system scalable to large datasets by taking advantage of the distributed computing capabilities of the Apache Hadoop framework. We design and put in place a general schema for storing and updating data stored in our Hadoop cluster. Throughout the project, we coordinate with other teams to help them make use of readily available machine learning software for Hadoop, and we also provide support for using MapReduce. We found that different teams were able to easily integrate their results in the design we developed and that uploading these results into a data store for communication with Solr can be done, in the best cases, in a few minutes. We conclude that Hadoop is an appropriate framework for the IDEAL project; however, we also recommend exploring the use of the Spark framework.
- Reducing Noise for IDEALWang, Xiangwen; Chandrasekar, Prashant (2015-05-12)The corpora for which we are building an information retrieval system consists of tweets and web pages (extracted from URL links that might be included in the tweets) that have been selected based on rudimentary string matching provided by the Twitter API. As a result, the corpora are inherently noisy and contain a lot of irrelevant information. This includes documents that are non-English, off topic articles and other information within them such as: stop-words, whitespace characters, non-alphanumeric characters, icons, broken links, HTML/XML tags, scripting codes, CSS style sheets, etc. In our attempt to build an efficient information retrieval system for events, through Solr, we are devising a matching system for the corpora by adding various facets and other properties to serve as dimensions for each document. These dimensions function as additional criteria that will enhance the matching and thereby the retrieval mechanism of Solr. They are metadata from classification, clustering, named-entities, topic modeling and social graph scores implemented by other teams in the class. It is of utmost importance that each of these initiatives is precise to ensure the enhancement of the matching and retrieval system. The quality of their work is dependent directly or indirectly on the quality of data that is provided to them. Noisy data will skew the results and each team would need to perform additional tasks to get rid of it prior to executing their core functionalities. It is our role and responsibility to remove irrelevant content or “noisy data” from the corpora. For both tweets and web pages, we cleaned entries that were written in English and discarded the rest. For tweets, we first extracted user handle information, URLs, and hashtags. We cleaned up the tweet text by removing non-ASCII character sequences and standardized the text using case folding, stemming and stop word removal. For the scope of this project, we considered cleaning only HTML formatted web pages and entries written in plain text file format. All other entries (or documents) such as videos, images, etc. were discarded. For the “valid” entries, we extracted the URLs within the web pages to enumerate the outgoing links. Using the Python package readability, we were able to clean advertisement, header and footer content. We were able to organize the remaining content and extract the article text using another Python package beatifulsoup4. We completed the cleanup by standardizing the text by removing non-ASCII characters, stemming, stop word removal and case folding. As a result, 14 tweet collections and 9 web pages collections were cleaned and indexed into Solr for retrieval.
- Document Clustering for IDEALThumma, Sujit Reddy; Kalidas, Rubasri; Torkey, Hanaa (2015-05-13)Document clustering is an unsupervised classification of text documents into groups (clusters). The documents with similar properties are grouped together into one cluster. Documents which have dissimilar patterns are grouped into different clusters. Clustering deals with finding a structure in a collection of unlabeled data. The main goal of this project is to enhance Solr search results with the help of offline data clustering. In our project, we propose to iterate and optimize clustering results using various clustering algorithms and techniques. Specifically, we evaluate the K-Means, Streaming K-Means, and Fuzzy K-Means algorithms available in the Apache Mahout software package. Our data consists of tweet archives and web page archives related to tweets. Document clustering involves data pre-processing, data clustering using clustering algorithms, and data post-processing. The final output which includes document ID, cluster ID, and cluster label, is stored in HBase for further indexing into the Solr search engine. Solr search recall is enhanced by boosting document relevance scores based on the clustered sets of documents. We propose three metrics to evaluate the cluster results: Silhoutte scores, confusion matrix with homogeneous labelled data, and human judgement. To optimize the clustering results we identify various tunable parameters that are input to the clustering algorithms and demonstrate the effectiveness of those tuning parameters. Finally, we have automated the entire clustering pipeline using several scripts and deployed them on a Hadoop cluster for large scale data clustering of tweet and webpage collections.
- Solr Team Project ReportGruss, Richard; Choudhury, Ananya; Komawar, Nikhil (2015-05-13)The Integrated Digital Event Archive and Library (IDEAL) is a Digital Library project that aims to collect, index, archive and provide access to digital contents related to important events, including disasters, man-made or natural. It extracts event data mostly from social media sites such as Twitter and crawls related web. However, the volume of information currently on the web on any event is enormous and highly noisy, making it extremely difficult to get all specific information. The objective of this course is to build a state-of-the-art information retrieval system in support of the IDEAL project. The class was divided into eight teams, each team being assigned a part of the project that when successfully implemented will enhance the IDEAL project’s functionality. The final product, which will be the culmination of these 8 teams’ efforts, is a fast and efficient search engine for events occurring around the world. This report describes the work completed by the Solr team as a contribution towards searching and retrieving the tweets and web pages archived by IDEAL. If we can visualize the class project as a tree structure, then Solr is the root of the tree, which builds on all other team’s efforts. Hence we actively interacted with all other teams to come up with a generic schema for the documents and their corresponding metadata to be indexed by Solr. As Solr interacts with HDFS via HBase where the data is stored, we also defined an HBase schema and configured the Lily Indexer to set up a fast communication between HBase and Solr. We batch-indexed 8.5 million of the 84 million tweets before encountering memory limitations on the single-node Solr installation. Focusing our efforts therefore on building a search experience around the small collections, we created a 3.4-million tweet collection and a 12,000-webpage collection. Our custom search, which leverages the differential field weights in Solr’s edismax Query Parser and two custom Query Components, achieved precision levels in excess of 90%.
- CS5604: Clustering and Social Networks for IDEALVishwasrao, Saket; Thorve, Swapna; Tang, Lijie (2016-05-03)The Integrated Digital Event Archiving and Library (IDEAL) project of Virginia Tech provides services for searching, browsing, analysis, and visualization of over 1 billion tweets and over 65 million webpages. The project development involved a problem based learning approach which aims to build a state-of-the-art information retrieval system in support of IDEAL. With the primary objective of building a robust search engine on top of Solr, the entire project is divided into various segments like classification, clustering, topic modeling, etc., for improving search results. Our team focuses on two tasks: clustering and social networks. Both these tasks will be considered independent for now. The clustering task aims to congregate documents in groups such that documents within a cluster would be as similar as possible. Documents are tweets and webpages and we present results for different collections. The k-means algorithm is employed for clustering the documents. Two methods were employed for feature extraction, namely, TF-IDF score and the word2vec method. Evaluation of clusters is done by two methods – Within Set Sum of Squares (WSSE) and analyzing the output of the topic analysis team to extract cluster labels and find probability scores for a document. The later strategy is a novel approach for evaluation. This strategy can be used for assessing problems of cluster labeling, likelihood of a document belonging to a cluster, and hierarchical distribution of topics and cluster. The social networking task will extract information from Twitter data by building graphs. Graph theory concepts will be applied for accomplishing this task. Using dimensionality reduction techniques and probabilistic algorithms for clustering, as well as using improving on the cluster labelling and evaluation are some of the things that can be improved on our existing work in the future. Also, the clusters that we have generated can be used as an input source in Classification, Topic Analysis and Collaborative filtering for more accurate results.
- CS5604 Front-End User Interface TeamMasiane, Moeti; Warren, Lawrence (2016-05-03)This project is part of a wider research project whose focus is developing an information retrieval and analysis system in support of the IDEAL (Integrated Digital Event Archiving and Library) project. The search engine should retrieve results relating to tweet and web page data that have been collected by Dr E. Fox and his team of researchers from Virginia Polytechnic Institute and State University. The overall project has been broken into sub-projects and these smaller projects have been assigned to different teams. This portion of the project has the sole focus of research and development relating to the creation of the front end of the search engine. The front end is responsible for accepting search queries, logging user activities, displaying search results and presenting suggested content based on provided user queries and past user activity. In addition, we had to come up with ways to manipulate an established dataset to best give accurate results to users from varying levels of technical backgrounds without the expectation of having to learn a special system dialect beforehand. During our final presentation, our team was able to give a live demo of a working system, which used the other teams’ data and methods to create a graphical and interactive user interface. We were able to manipulate the data to create the first functional user interface under the scope of this project and have given a base for teams in the future to work from and become more successful. This submission includes a full report which entails the details for the direction and methods which were used to successfully create our UI as well as the slides from the final presentation given to the complete collective team at the end of our allotted time to produce a functional system.
- Classification Project in CS5604, Spring 2016Bock, Matthew; Cantrell, Michael; Shahin, Hossameldin L. (2016-05-04)In the grand scheme of a large Information Retrieval project, the work of our team was that of performing text classification on both tweet collections and their associated webpages. In order to accomplish this task, we sought to complete three primary goals. We began by performing research to determine the best way to extract information that can be used to represent a given document. Following that, we worked to determine the best method to select features and then construct feature vectors. Our final goal was to use the information gathered previously to build an effective way to classify each document in the tweet and webpage collections. These classifiers were built with consideration of the ontology developed for the IDEAL project. To truly show the effectiveness of our work at accomplishing our intended goals, we also provide an evaluation of our methodologies. The team assigned to perform this classification work last year researched various methods and tools that could be useful in accomplishing the goals we have set forth. Last year’s team developed a system that was able to accomplish similar goals to those we have set forth with a promising degree of success. Our goal for this year was to improve upon their successes using new technologies such as Apache Spark. Spark has provided us with the tools needed to build a well optimized system capable of working with the provided small collections of tweets and webpages in a fast and efficient manner. Spark is also very scalable, and based on our results with the small collections we have confidence in the performance of our system on larger collections. Also included in this submission is our final presentation of the project as presented to the CS5604 class, professor, and GRAs. The presentation provides a high level overview of the project requirements and our approach to them, as well as details about our implementation and evaluation. The submission also includes our source code, so that future classes can expand on the work we have done this semester.
- Topic Analysis project in CS5604, Spring 2016: Extracting Topics from Tweets and Webpages for IDEALMehta, Sneha; Vinayagam, Radha Krishnan (2016-05-04)The IDEAL (Integrated Digital Event Archiving and Library) project aims to ingest tweets and web-based content from social media and the web and index it for retrieval. One of the required milestones for a graduate-level course CS5604 on Information Storage and Retrieval is to implement a state-of-the-art information retrieval and analysis system in support of the IDEAL project. The overall objective of this project is to build a robust Information Retrieval system on top of Solr, a general purpose open-source search engine. To enable the search and retrieval process we use various approaches including Latent Dirichlet Allocation, Named-Entity Recognition, Clustering, Classification, Social Network Analysis and Front-end interface for search. The project has been divided into various segments and our team has been assigned Topic Analysis. A topic in this context is a set of words that can be used to represent a document. The output of our team will be a well-defined set of topics that describe each document in the collections we have. The topics will facilitate a facet based search in the frontend search interface. This submission includes the project report, final presentation, LDA code, test datasets, and results. In the project report,we introduce the relevant background, design & implementation, and the requirements to make our part functional. The developer’s manual describes our approach in detail. Walk-through tutorials for related software packages have been included in the user’s manual. Finally, we also provide exhaustive results and detailed evaluation methodologies for the topic quality.
- «
- 1 (current)
- 2
- 3
- »