CS6604: Digital Libraries
Permanent URI for this collection
Browse
Browsing CS6604: Digital Libraries by Content Type "Technical report"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
- CINET RegistryAgashe, Aditya; Hayatnagarkar, Harshal; Joshi, Sarang (2014-05-09)Cyber-infrastructure for Network Science (CINET) is a computational and analytic framework for the network science researcher and education. The cyber-infrastructure (CI) part of CINET is responsible for coordinating the interactions between user interface, digital library, resource manager, data broker, and execution broker components. CINET uses HPC resources to service experiment execution requests. CINET provides many realistic graphs for analysis. Galib, NetworkX, and SNAP are the computational engines that provide the capability to analyze different properties of the graphs. CINET hosts the Granite system and graph dynamical systems calculator (GDSC) system as public use applications. Datasets used by CINET are currently cataloged in a relational database and this project migrates them to a new digital object based catalog 'Registry'. This project uses the Fedora Commons Repository for storing digital objects. Project-Hydra is the abstraction layer over the Fedora-Commons repository. It is a customization of Ruby-on-Rails. Hydra stack provides RESTful web-services to interact with Fedora-Commons and perform CRUD operations on digital objects. In addition it also manages indices of digital objects using Apache Solr and provides a faceted browsing capability through Project Blacklight. The former implementation which is based on the relational model has limitations in modelling any semantics about relationships explicitly. Our current implementation mitigates this problem as the digital object repository model closely follows the object oriented paradigm. This helps in modelling inheritance and containment relationships in a more intuitive manner. CINET registry also provides rich services such as incentivization, memorization and utilization for advanced data analytics.
- Ensemble Classification ProjectAlabdulhadi, Mohammed H.; Kannan, Vijayasarathy; Soundarapandian, Manikandan; Hamid, Tania (2014-05-08)Transfer learning unlike traditional machine learning is a technique that allows domains, tasks and distributions used in training and testing to be different. Knowledge gained from one domain can be utilized to learn a completely different domain. Ensemble computing portal is a digital library that contains resources, communities and technologies to aid in teaching. The major objective of this project is to apply the learning gained from the ACM Computing Classification System and classify educational YouTube videos so that they can be included in the Ensemble computing portal. Metadata of technical papers published in ACM are indexed in a SOLR server and we issue REST calls to retrieve the required metadata viz. title, abstract and general terms that we use to build the features. We make use of the ACM Computing Classification System 2012’s classification hierarchy to train our classifiers. We build classifiers for the level-2 and level-3 categories in the classification tree to help in classifying the educational YouTube videos. We utilize YouTube data API to search for educational videos in YouTube and retrieve the metadata including title, description and transcripts of the videos. These become the features of our test set. We specifically search for YouTube playlists that contain educational videos as we found out from our experience that neither a regular video search nor a search for videos in channels do retrieve relevant educational videos. We evaluate our classifiers using 10-fold cross-validation and present their accuracy in this report. With the classifiers built and trained using ACM metadata, we provide them the metadata that we collect from YouTube as the test data and manually evaluate the predictions. The results of our manual evaluation and the accuracy of our classifiers are also discussed. We identified that the ACM Computing Classification System’s hierarchy is sometimes ambiguous and YouTube metadata are not always reliable. These are the major factors that contribute to the reduced accuracy of our classifiers. In the future, we hope sophisticated natural language processing techniques can be applied to refine the features of both training and target data, which would help in improving the performance. We believe that more relevant metadata from YouTube in the form of transcripts and embedded text can be collected using sophisticated voice-to-text conversion and image retrieval algorithms respectively. This idea of transfer learning can also be extended to classify the presentation slides that are available in slideshare (http://www.slideshare.net) and also to classify certain educational blogs.
- Epidemiology NetworkSundar, Naren; Xu, Kui (2014-05-11)This project aims at developing an RDF graph building service for Cyber Infrastructure for Network Science (CINET). The purpose of this service is to do web crawling and find digital contents related to user requests. More specifically, the type of contents to be collected should be related to epidemiology. Eventually the service should deliver an RDF network of digital contents that can be stored on CINET for analysis. Simply using a search engine such as Google, or a web crawler in an undirected way, won't be able to satisfy the requirements of this problem, due to the lack of organization of the results and the ambiguity of the information. Our service should present to users networks of interconnected digital objects, which are organized based on their topics. In the results, all digital objects are connected as a network of related contents based on a user's request. In addition to that, those who are closer to a topic will be more strongly connected in a sub-network. The developed topic modeling approach emulates human behavior when searching relevant research papers. It automatically crawls the DBLP bibliography website and constructs a network of papers based on a user query.
- IDEAL PagesFarghally, Mohammed; Elbery, Ahmed (2014-05-10)The main goal of this project is to provide a convenient Web enabled interface to a large collection of event-related webpages supporting the two main services of browsing and searching. We first studied the events and decided what fields are required to build the events index based on the dataset available to us. We then configured a SolrCloud with a collection based on these fields in the Schema.xml file. Then we built a Hadoop Map-Reduce function along with SolrCloud to index documents related to the data about 60 events crawled from the Web. Then we were able to find a way to interface with the Solr server and indexed documents through a PHP server application. Finally, we were able to design a convenient user interface that allows users to browse the documents by event category and event name as well as to search the document collection for particular keywords.
- Qatar content classificationHandosa, Mohamed (2014-05-09)This reports on a term project for the CS660 Digital libraries course (Spring 2014). The project has been conducted under the supervision of Prof. Edward Fox and Mr. Tarek Kanan. The goal is to develop an Arabic newspaper article classifier. We have built a collection of 700 Arabic newspaper articles and 1700 Arabic full-newspaper PDF files. A stemmer, named “P-Stemmer”, is proposed. Evaluation have shown that P-Stemmer outperforms Larkey’s widely used light stemmer. Several classification techniques were tested on Arabic data including SVM, Naïve Bayes and Random Forest. We built and tested 21 multiclass classifiers, 15 binary classifiers, and 5 compound classifiers using the voting technique. Finally, we uploaded the classified instances to Apache Solr for searching and indexing purposes.
- Twitter MetadataShuffett, Michael (2014-05-10)A number of projects and research efforts work with collections of tweets. Of particular interest is the collection of tweets related to world events. Many organizations have their own individual tweet collections regarding specific events; however, there is currently no effective support for collaboration. Metadata standards foster collaboration by allowing groups to adhere to a unified format so they can seamlessly inter-operate. In part one of the Twitter Metadata project, I define a tweet-level metadata standard that leverages the Twitter API format, as well as a collection-level metadata standard which combines Dublin Core and PROV-O. By combining two diverse existing standards (Dublin Core and PROV-O) into an RDF based specification, the proposed standard is able to capture both the descriptive metadata as well as provenance of the collections. In part two of the Twitter Metadata project, I create a tool called TweetID in order to further foster collaboration with tweet collections. TweetID is a web application that allows its users to upload tweet collections. TweetID extracts, and provides an interface to, the underlying tweet-level and collection-level metadata. Furthermore, TweetID also provides the ability to merge multiple collections together, allowing researchers to compare their collections to others’, as well as potentially augment their event collections for higher recall.
- Unsupervised Event Extraction from News and TwitterXuan, Zhang; Wei, Huang; Ji, Wang; Tianyu, Geng (2014-05-11)Living in the age of big data, we are facing massive information every day, especially that from the mainstream news and the social networks. Due to its gigantic volume, one may get frustrated when trying to identify the key information which really matters. Thus, how to summarize the key information from the enormous amount of news and tweets becomes essential. Addressing this problem, this project explores the approaches to extract key events from newswires and Twitter data in an unsupervised manner, where Topic Modeling and Named Entity Recognition have been applied. Various methods have been tried regarding the different traits of news and tweets. The relevance between the news events and the corresponding Twitter events is studied as well. Tools have been developed to implement and evaluate these methods. Our experiments show that these tools can effectively extract key events from the news and tweets data sets. The tools, documents and data sets can be used for educational purposes and as a part of the IDEAL project of Virginia Tech.