CS6604: Digital Libraries
Permanent URI for this collection
Browse
Browsing CS6604: Digital Libraries by Content Type "Software"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
- CS6604 Spring 2017 Global Events Team ProjectLi, Liuqing; Harb, Islam; Galad, Andrej (Virginia Tech, 2017-05-03)This submission describes the work the Global Events team completed in Spring 2017. It includes the final report and presentation, as well as key relevant materials (source code). Based on the previous reports and different modules created by former teams, the Global Events team established a pipeline for processing Web ARChives supporting the IDEAL and GETAR projects, both funded by NSF. With the Internet Archive’s help, the Global Events team enhanced the Event Focused Crawler to retrieve more relevant webpages (i.e., about school shooting events) in WARC format. ArchiveSpark, an Apache Spark framework that facilitates access to Web Archives, was deployed on a stand-alone server, and multiple techniques, such as parsing, Stanford NER, regular expression and statistical methods, were leveraged to process and analyze the data, and describe those events. For the data visualization, an integrated user interface using Gradle was designed and implemented for trend results, which can be easily used by both CS and non-CS researchers and students. Moreover, new well written manuals could be easier for users and developers to read and get familiar with ArchiveSpark, Spark, and Scala.
- IDEAL PagesFarghally, Mohammed; Elbery, Ahmed (2014-05-10)The main goal of this project is to provide a convenient Web enabled interface to a large collection of event-related webpages supporting the two main services of browsing and searching. We first studied the events and decided what fields are required to build the events index based on the dataset available to us. We then configured a SolrCloud with a collection based on these fields in the Schema.xml file. Then we built a Hadoop Map-Reduce function along with SolrCloud to index documents related to the data about 60 events crawled from the Web. Then we were able to find a way to interface with the Solr server and indexed documents through a PHP server application. Finally, we were able to design a convenient user interface that allows users to browse the documents by event category and event name as well as to search the document collection for particular keywords.
- Sentiment and Topic AnalysisBartolome, Abigail; Bock, Matthew; Vinayagam, Radha Krishnan; Krishnamurthy, Rahul (Virginia Tech, 2017-05-03)The IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects have collected over 1.5 billion tweets, and webpages from social media and the World Wide Web and indexed them to be easily retrieved and analyzed. This gives researchers an extensive library of documents that reflect the interests and sentiments of the public in reaction to an event. By applying topic analysis to collections of tweets, researchers can learn the topics of most interest or concern to the general public. Adding a layer of sentiment analysis to those topics will illustrate how the public felt in relation to the topics that were found. The Sentiment and Topic Analysis team has designed a system that joins topic analysis and sentiment analysis for researchers who are interested in learning more about public reaction to global events. The tool runs topic analysis on a collection of tweets, and the user can select a topic of interest and assess the sentiments with regard to that topic (i.e., positive vs. negative). This submission covers the background, requirements, design and implementation of our contributions to this project. Furthermore, we include data, scripts, source code, a user manual, and a developer manual to assist in any future work.
- Social Communities Knowledge Discovery: Approaches applied to clinical studyChandrasekar, Prashant (Virginia Tech, 2017-05)In recent efforts being conducted by the Social Interactome team, to validate hypotheses of the study, we have worked to make sense of the data that has been collected during two 16-week experiments and three Amazon Mechanical Turk deployments. The complexity in the data has made it challenging to discover insights/patterns. The goal of the semester was to explore newer methods to analyze the data. Through such discovery, we can test/validate hypotheses about the data, that would provide a direction for our contextual inquiry to predict attributes and behavior of participants in the study. The report and slides highlight two possible approaches that employ statistical relational learning for structure learning and network classification. Related files include data and software used during this study; results are given from the analyses undertaken.
- Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use CasesKarajeh, Ola; Arachie, Chidubem; Powell, Edward; Hussein, Eslam (Virginia Tech, 2019-12-24)The proliferation of data on social media has driven the need for researchers to develop algorithms to filter and process this data into meaningful information. In this project, we consider the task of classifying tweets relative to some topic or event and labeling them as informational or non-informational, using the features in the tweets. We focus on two collections from different domains: a diabetes dataset in the health domain and a heartbleed dataset in the security domain. We show the performance of our method in classifying tweets in the different collections. We employ two approaches to generate features for our models: 1) a graph based feature representation and 2) a vector space model, e.g., with TF-IDF weighting or a word embedding. The representations generated are fed into different machine learning algorithms (Logistic Regression, Naïve Bayes, and Decision Tree) to perform the classification task. We evaluate these approaches using metrics (accuracy, precision, recall, and F1-score) on a held out test dataset. Our results show that we can generalize our approach with tweets across different domains.
- Twitter MetadataShuffett, Michael (2014-05-10)A number of projects and research efforts work with collections of tweets. Of particular interest is the collection of tweets related to world events. Many organizations have their own individual tweet collections regarding specific events; however, there is currently no effective support for collaboration. Metadata standards foster collaboration by allowing groups to adhere to a unified format so they can seamlessly inter-operate. In part one of the Twitter Metadata project, I define a tweet-level metadata standard that leverages the Twitter API format, as well as a collection-level metadata standard which combines Dublin Core and PROV-O. By combining two diverse existing standards (Dublin Core and PROV-O) into an RDF based specification, the proposed standard is able to capture both the descriptive metadata as well as provenance of the collections. In part two of the Twitter Metadata project, I create a tool called TweetID in order to further foster collaboration with tweet collections. TweetID is a web application that allows its users to upload tweet collections. TweetID extracts, and provides an interface to, the underlying tweet-level and collection-level metadata. Furthermore, TweetID also provides the ability to merge multiple collections together, allowing researchers to compare their collections to others’, as well as potentially augment their event collections for higher recall.