CS6604: Digital Libraries

Permanent URI for this collection


Recent Submissions

Now showing 1 - 19 of 19
  • Classification and extraction of information from ETD documents
    Aromando, John; Banerjee, Bipasha; Ingram, William A.; Jude, Palakh; Kahu, Sampanna (Virginia Tech, 2020-01-30)
    In recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning.
  • Otrouha: Automatic Classification of Arabic ETDs
    Alotaibi, Fatimah; Abdelrahman, Eman (Virginia Tech, 2020-01-23)
    ETDs are becoming a new genre of documents that is highly precious and worth preserving. This has resulted in a sustainable need to build an effective tool to facilitate retrieving ETD collections. While Arabic ETDs have gained increasing attention, many challenges ensued due to lack of resources and complexity of information retrieval in the Arabic language. Therefore, this project focuses on making Arabic ETDs more accessible by facilitating browsing and searching. The aim is to build an automated classifier that categorizes an Arabic ETD based on its abstract. Our raw dataset was obtained by crawling the AskZad digital library website. Then, we conducted some pre-processing techniques on the dataset to make it suitable for our classification process. We developed automatic classification methods using various classifiers: Support Vector Machines and SVC, Random Forest, and Decision Trees. We then used an ensemble classifier of the two classifiers that generated the highest accuracy. Then, we applied evaluation techniques commonly used such as including 10-fold cross-validation. The results show better performance for the binary classification with average accuracy 68%per category, where multiclass classification performed poorly with average accuracy 24%.
  • Toward an Intelligent Crawling Scheduler for Archiving News Websites Using Reinforcement Learning
    Wang, Xinyue; Ahuja, Naman; Llorens, Nathaniel; Bansal, Ritesh; Dhar, Siddharth (Virginia Tech, 2019-12-03)
    Web crawling is one of the fundamental activities for many kinds of web technology organizations and companies such as Internet Archive and Google. While companies like Google often focus on content delivery for users, web archiving organizations such as the Internet Archive pay more attention to the accurate preservation of the web. Crawling accuracy and efficiency are major concerns in this task. An ideal crawling module should be able to keep up with the changes in the target web site with minimal crawling frequency to maximize the routine crawling efficiency. In this project, we investigate using information from web archives' history to help the crawling process within the scope of news websites. We aim to build a smart crawling module that can predict web content change accurately both on the web page and web site structure level through modern machine learning algorithms and deep learning architectures. At the end of the project: We have collected and processed raw web archive collections from Archive.org and through our frequent crawling jobs. We have developed methods to extract identical copies of web page content and web site structure from the web archive data. We have implemented baseline models for predicting web page content change and web site structure change, web page content change with supervised machine learning algorithms; We have implemented two different reinforcement learning models for generating a web page crawling plan: a continuous prediction model and a sparse prediction model. Our results show that the reinforcement learning modal has the potential to work as an intelligent web crawling scheduler.
  • Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use Cases
    Karajeh, Ola; Arachie, Chidubem; Powell, Edward; Hussein, Eslam (Virginia Tech, 2019-12-24)
    The proliferation of data on social media has driven the need for researchers to develop algorithms to filter and process this data into meaningful information. In this project, we consider the task of classifying tweets relative to some topic or event and labeling them as informational or non-informational, using the features in the tweets. We focus on two collections from different domains: a diabetes dataset in the health domain and a heartbleed dataset in the security domain. We show the performance of our method in classifying tweets in the different collections. We employ two approaches to generate features for our models: 1) a graph based feature representation and 2) a vector space model, e.g., with TF-IDF weighting or a word embedding. The representations generated are fed into different machine learning algorithms (Logistic Regression, Naïve Bayes, and Decision Tree) to perform the classification task. We evaluate these approaches using metrics (accuracy, precision, recall, and F1-score) on a held out test dataset. Our results show that we can generalize our approach with tweets across different domains.
  • Cross-Platform Data Collection and Analysis for Online Hate Groups
    Chandaluri, Rohit Kumar; Phadke, Shruti (Virginia Tech, 2019-12-26)
    Hate groups are using online social media increasingly over the last decade. An online audience of hate groups is exposed to the material with hateful agenda and underlying propaganda. The presence of hate across multiple social media platforms poses an important question for the research community: how do hate groups use different social media platforms differently? As a first step towards answering this question, we propose HateCorp: Cross-platform dataset of online hate group communication. In this project, we first identify various online hate groups and their Twitter, Facebook and YouTube accounts. Then we retrospectively collect the data over six months and present selected linguistic, social engagement and informational trends. In the future, we aim to expand this dataset in real-time along with creating a publicly accessible hate communication monitoring platform that could be useful to other researchers and social media policymakers.
  • ACM Venue Recommendation System
    Kumar, Harinni Kodur; Tyagi, Tanya (Virginia Tech, 2019-12-23)
    A frequent goal of a researcher is to publish his/her work in appropriate conferences and journals. With a large number of options for venues in the microdomains of every research discipline, the issue of selecting suitable locations for publishing cannot be underestimated. Further, the venues diversify themselves in the form of workshops, symposiums, and challenges. Several publishers such as IEEE and Springer have recognized the need to address this issue and have developed journal recommenders. In the proposed project, the goal is to design and develop similar a recommendation system for the ACM dataset. The conventional approach to building such a recommendation system is to utilize the content features in a dataset through content-based and collaborative approaches and proffer suggestions. An alternative is to view this recommendation problem from a classification perspective. With the success of deep learning classifiers in recent times and their pervasiveness in several domains, our goal is to solve the problem of recommending conference and journal venues by incorporating deep learning methodologies given some information about the submission like title, keywords, abstract, etc. The dataset used for the project is the ACM Digital Library metadata that includes metadata and textual information for research papers and journals submitted at various conferences and journals over the past 60 years. Our current system offers recommendations based on 80 binary classifiers. From our results, we could observe that for past submissions, our system recommends ground truth venues precisely. In the subsequent iterations of the project, we aim to improve the performance of individual classifiers and thereby offer better recommendations.
  • Generating Synthetic Healthcare Records Using Convolutional Generative Adversarial Networks
    Torfi, Amirsina; Beyki, Mohammadreza (Virginia Tech, 2019-12-20)
    Deep learning models have demonstrated high-quality performance in several areas such as image classification and speech processing. However, creating a deep learning model using electronic health record (EHR) data requires addressing particular privacy challenges that make this issue unique to researchers in this domain. This matter focuses attention on generating realistic synthetic data to amplify privacy. Existing methods for artificial data generation suffer from different limitations such as being bound to particular use cases. Furthermore, their generalizability to real-world problems is controversial regarding the uncertainties in defining and measuring key realistic characteristics. Henceforth, there is a need to establish insightful metrics (and to measure the validity of synthetic data), as well as quantitative criteria regarding privacy restrictions. We propose the use of Generative Adversarial Networks to help satisfy requirements for realistic characteristics and acceptable values of privacy metrics simultaneously. The present study makes several unique contributions to synthetic data generation in the healthcare domain. First, utilizing 1-D Convolutional Neural Networks (CNNs), we devise a new approach to capturing the correlation between adjacent diagnosis records. Second, we employ convolutional autoencoders to map the discrete-continuous values. Finally, we devise a new approach to measure the similarity between real and synthetic data, and a means to measure the fidelity of the synthetic data and its associated privacy risks.
  • Social Communities Knowledge Discovery: Approaches applied to clinical study
    Chandrasekar, Prashant (Virginia Tech, 2017-05)
    In recent efforts being conducted by the Social Interactome team, to validate hypotheses of the study, we have worked to make sense of the data that has been collected during two 16-week experiments and three Amazon Mechanical Turk deployments. The complexity in the data has made it challenging to discover insights/patterns. The goal of the semester was to explore newer methods to analyze the data. Through such discovery, we can test/validate hypotheses about the data, that would provide a direction for our contextual inquiry to predict attributes and behavior of participants in the study. The report and slides highlight two possible approaches that employ statistical relational learning for structure learning and network classification. Related files include data and software used during this study; results are given from the analyses undertaken.
  • Sentiment and Topic Analysis
    Bartolome, Abigail; Bock, Matthew; Vinayagam, Radha Krishnan; Krishnamurthy, Rahul (Virginia Tech, 2017-05-03)
    The IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects have collected over 1.5 billion tweets, and webpages from social media and the World Wide Web and indexed them to be easily retrieved and analyzed. This gives researchers an extensive library of documents that reflect the interests and sentiments of the public in reaction to an event. By applying topic analysis to collections of tweets, researchers can learn the topics of most interest or concern to the general public. Adding a layer of sentiment analysis to those topics will illustrate how the public felt in relation to the topics that were found. The Sentiment and Topic Analysis team has designed a system that joins topic analysis and sentiment analysis for researchers who are interested in learning more about public reaction to global events. The tool runs topic analysis on a collection of tweets, and the user can select a topic of interest and assess the sentiments with regard to that topic (i.e., positive vs. negative). This submission covers the background, requirements, design and implementation of our contributions to this project. Furthermore, we include data, scripts, source code, a user manual, and a developer manual to assist in any future work.
  • ETDseer Concept Paper
    Ma, Yufeng; Jiang, Tingting; Shrestha, Chandani (Virginia Tech, 2017-05-03)
    ETDSeer (electronic thesis and dissertation digital library connected with SeerSuite) will build on 15 years of collaboration between teams at Virginia Tech (VT) and Penn State University (PSU), since both have been leaders in the worldwide digital library (DL) community. VT helped launch the national and international efforts for ETDs more than 20 years ago, which have been led by the Networked Digital Library of Theses and Dissertations (NDLTD, directed by PI Fox); its Union Catalog has increased to 5 million records. PSU hosts CiteSeerX, which co-PI Giles launched almost 20 years ago, and which is connected with a wide variety of research results under the SeerSuite family. ETDs, typically in PDF, are a largely untapped international resource. Digital libraries with advanced services can effectively address the broad needs to discover and utilize ETDs of interest. Our research will leverage SeerSuite methods that have been applied mostly to short documents, plus a variety of exploratory studies at VT, and will yield a “web of graduate research”, rich knowledge bases, and a digital library with effective interfaces. References will be analyzed and converted to canonical forms, figures and tables will be recognized and re-represented for flexible searching, small sections (acknowledgments, biographical sketches) will be mined, and aids for researchers will be built especially from literature reviews and discussions of future work. Entity recognition and disambiguation will facilitate flexible use of a large graph of linked open data.
  • CS6604 Spring 2017 Global Events Team Project
    Li, Liuqing; Harb, Islam; Galad, Andrej (Virginia Tech, 2017-05-03)
    This submission describes the work the Global Events team completed in Spring 2017. It includes the final report and presentation, as well as key relevant materials (source code). Based on the previous reports and different modules created by former teams, the Global Events team established a pipeline for processing Web ARChives supporting the IDEAL and GETAR projects, both funded by NSF. With the Internet Archive’s help, the Global Events team enhanced the Event Focused Crawler to retrieve more relevant webpages (i.e., about school shooting events) in WARC format. ArchiveSpark, an Apache Spark framework that facilitates access to Web Archives, was deployed on a stand-alone server, and multiple techniques, such as parsing, Stanford NER, regular expression and statistical methods, were leveraged to process and analyze the data, and describe those events. For the data visualization, an integrated user interface using Gradle was designed and implemented for trend results, which can be easily used by both CS and non-CS researchers and students. Moreover, new well written manuals could be easier for users and developers to read and get familiar with ArchiveSpark, Spark, and Scala.
  • Epidemiology Network
    Sundar, Naren; Xu, Kui (2014-05-11)
    This project aims at developing an RDF graph building service for Cyber Infrastructure for Network Science (CINET). The purpose of this service is to do web crawling and find digital contents related to user requests. More specifically, the type of contents to be collected should be related to epidemiology. Eventually the service should deliver an RDF network of digital contents that can be stored on CINET for analysis. Simply using a search engine such as Google, or a web crawler in an undirected way, won't be able to satisfy the requirements of this problem, due to the lack of organization of the results and the ambiguity of the information. Our service should present to users networks of interconnected digital objects, which are organized based on their topics. In the results, all digital objects are connected as a network of related contents based on a user's request. In addition to that, those who are closer to a topic will be more strongly connected in a sub-network. The developed topic modeling approach emulates human behavior when searching relevant research papers. It automatically crawls the DBLP bibliography website and constructs a network of papers based on a user query.
  • Unsupervised Event Extraction from News and Twitter
    Xuan, Zhang; Wei, Huang; Ji, Wang; Tianyu, Geng (2014-05-11)
    Living in the age of big data, we are facing massive information every day, especially that from the mainstream news and the social networks. Due to its gigantic volume, one may get frustrated when trying to identify the key information which really matters. Thus, how to summarize the key information from the enormous amount of news and tweets becomes essential. Addressing this problem, this project explores the approaches to extract key events from newswires and Twitter data in an unsupervised manner, where Topic Modeling and Named Entity Recognition have been applied. Various methods have been tried regarding the different traits of news and tweets. The relevance between the news events and the corresponding Twitter events is studied as well. Tools have been developed to implement and evaluate these methods. Our experiments show that these tools can effectively extract key events from the news and tweets data sets. The tools, documents and data sets can be used for educational purposes and as a part of the IDEAL project of Virginia Tech.
  • IDEAL Pages
    Farghally, Mohammed; Elbery, Ahmed (2014-05-10)
    The main goal of this project is to provide a convenient Web enabled interface to a large collection of event-related webpages supporting the two main services of browsing and searching. We first studied the events and decided what fields are required to build the events index based on the dataset available to us. We then configured a SolrCloud with a collection based on these fields in the Schema.xml file. Then we built a Hadoop Map-Reduce function along with SolrCloud to index documents related to the data about 60 events crawled from the Web. Then we were able to find a way to interface with the Solr server and indexed documents through a PHP server application. Finally, we were able to design a convenient user interface that allows users to browse the documents by event category and event name as well as to search the document collection for particular keywords.
  • Twitter Metadata
    Shuffett, Michael (2014-05-10)
    A number of projects and research efforts work with collections of tweets. Of particular interest is the collection of tweets related to world events. Many organizations have their own individual tweet collections regarding specific events; however, there is currently no effective support for collaboration. Metadata standards foster collaboration by allowing groups to adhere to a unified format so they can seamlessly inter-operate. In part one of the Twitter Metadata project, I define a tweet-level metadata standard that leverages the Twitter API format, as well as a collection-level metadata standard which combines Dublin Core and PROV-O. By combining two diverse existing standards (Dublin Core and PROV-O) into an RDF based specification, the proposed standard is able to capture both the descriptive metadata as well as provenance of the collections. In part two of the Twitter Metadata project, I create a tool called TweetID in order to further foster collaboration with tweet collections. TweetID is a web application that allows its users to upload tweet collections. TweetID extracts, and provides an interface to, the underlying tweet-level and collection-level metadata. Furthermore, TweetID also provides the ability to merge multiple collections together, allowing researchers to compare their collections to others’, as well as potentially augment their event collections for higher recall.
  • CINET Registry
    Agashe, Aditya; Hayatnagarkar, Harshal; Joshi, Sarang (2014-05-09)
    Cyber-infrastructure for Network Science (CINET) is a computational and analytic framework for the network science researcher and education. The cyber-infrastructure (CI) part of CINET is responsible for coordinating the interactions between user interface, digital library, resource manager, data broker, and execution broker components. CINET uses HPC resources to service experiment execution requests. CINET provides many realistic graphs for analysis. Galib, NetworkX, and SNAP are the computational engines that provide the capability to analyze different properties of the graphs. CINET hosts the Granite system and graph dynamical systems calculator (GDSC) system as public use applications. Datasets used by CINET are currently cataloged in a relational database and this project migrates them to a new digital object based catalog 'Registry'. This project uses the Fedora Commons Repository for storing digital objects. Project-Hydra is the abstraction layer over the Fedora-Commons repository. It is a customization of Ruby-on-Rails. Hydra stack provides RESTful web-services to interact with Fedora-Commons and perform CRUD operations on digital objects. In addition it also manages indices of digital objects using Apache Solr and provides a faceted browsing capability through Project Blacklight. The former implementation which is based on the relational model has limitations in modelling any semantics about relationships explicitly. Our current implementation mitigates this problem as the digital object repository model closely follows the object oriented paradigm. This helps in modelling inheritance and containment relationships in a more intuitive manner. CINET registry also provides rich services such as incentivization, memorization and utilization for advanced data analytics.
  • Qatar content classification
    Handosa, Mohamed (2014-05-09)
    This reports on a term project for the CS660 Digital libraries course (Spring 2014). The project has been conducted under the supervision of Prof. Edward Fox and Mr. Tarek Kanan. The goal is to develop an Arabic newspaper article classifier. We have built a collection of 700 Arabic newspaper articles and 1700 Arabic full-newspaper PDF files. A stemmer, named “P-Stemmer”, is proposed. Evaluation have shown that P-Stemmer outperforms Larkey’s widely used light stemmer. Several classification techniques were tested on Arabic data including SVM, Naïve Bayes and Random Forest. We built and tested 21 multiclass classifiers, 15 binary classifiers, and 5 compound classifiers using the voting technique. Finally, we uploaded the classified instances to Apache Solr for searching and indexing purposes.
  • Ensemble Classification Project
    Alabdulhadi, Mohammed H.; Kannan, Vijayasarathy; Soundarapandian, Manikandan; Hamid, Tania (2014-05-08)
    Transfer learning unlike traditional machine learning is a technique that allows domains, tasks and distributions used in training and testing to be different. Knowledge gained from one domain can be utilized to learn a completely different domain. Ensemble computing portal is a digital library that contains resources, communities and technologies to aid in teaching. The major objective of this project is to apply the learning gained from the ACM Computing Classification System and classify educational YouTube videos so that they can be included in the Ensemble computing portal. Metadata of technical papers published in ACM are indexed in a SOLR server and we issue REST calls to retrieve the required metadata viz. title, abstract and general terms that we use to build the features. We make use of the ACM Computing Classification System 2012’s classification hierarchy to train our classifiers. We build classifiers for the level-2 and level-3 categories in the classification tree to help in classifying the educational YouTube videos. We utilize YouTube data API to search for educational videos in YouTube and retrieve the metadata including title, description and transcripts of the videos. These become the features of our test set. We specifically search for YouTube playlists that contain educational videos as we found out from our experience that neither a regular video search nor a search for videos in channels do retrieve relevant educational videos. We evaluate our classifiers using 10-fold cross-validation and present their accuracy in this report. With the classifiers built and trained using ACM metadata, we provide them the metadata that we collect from YouTube as the test data and manually evaluate the predictions. The results of our manual evaluation and the accuracy of our classifiers are also discussed. We identified that the ACM Computing Classification System’s hierarchy is sometimes ambiguous and YouTube metadata are not always reliable. These are the major factors that contribute to the reduced accuracy of our classifiers. In the future, we hope sophisticated natural language processing techniques can be applied to refine the features of both training and target data, which would help in improving the performance. We believe that more relevant metadata from YouTube in the form of transcripts and embedded text can be collected using sophisticated voice-to-text conversion and image retrieval algorithms respectively. This idea of transfer learning can also be extended to classify the presentation slides that are available in slideshare (http://www.slideshare.net) and also to classify certain educational blogs.
  • Knowledge Building and Sharing: A Metamodel for Guided Research, Learning, and Application
    Zeitz, Kimberly; Frisina, Chris (2014-05-07)
    Specific field methodology and models cannot be an afterthought when designing, developing, or administering any kind of technology or system. However, the mass amount of techniques and options can be both overwhelming and confusing leading to the selection of incorrect or insufficient techniques. For an example in the security field, choosing an inadequate methodology can have harmful repercussions including everything from cyber-attacks to illegal data access and retrieval of private information. The solution is a metamodel that combines the most recent techniques and options categorized by common fields and concerns and presented to allow for a user to weigh the benefits, negatives, and particular circumstances needed to meet the unique needs of the user's system or environment. This metamodel would be of particular use for teaching and the sharing of knowledge. Contrary to some models which only present a high level overview, MOSAIC, is our example section of such a metamodel that will guide the user through the learning of and selection of analysis techniques and new security mechanisms. We provide the background and format for such a metamodel, our process for the selection of the security areas we focused on, and the example proof of concept, MOSAIC, Model of Securing Application Information Confidentiality.