CS6604: Digital Libraries
Permanent URI for this collection
Browse
Browsing CS6604: Digital Libraries by Content Type "Report"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
- ACM Venue Recommendation SystemKumar, Harinni Kodur; Tyagi, Tanya (Virginia Tech, 2019-12-23)A frequent goal of a researcher is to publish his/her work in appropriate conferences and journals. With a large number of options for venues in the microdomains of every research discipline, the issue of selecting suitable locations for publishing cannot be underestimated. Further, the venues diversify themselves in the form of workshops, symposiums, and challenges. Several publishers such as IEEE and Springer have recognized the need to address this issue and have developed journal recommenders. In the proposed project, the goal is to design and develop similar a recommendation system for the ACM dataset. The conventional approach to building such a recommendation system is to utilize the content features in a dataset through content-based and collaborative approaches and proffer suggestions. An alternative is to view this recommendation problem from a classification perspective. With the success of deep learning classifiers in recent times and their pervasiveness in several domains, our goal is to solve the problem of recommending conference and journal venues by incorporating deep learning methodologies given some information about the submission like title, keywords, abstract, etc. The dataset used for the project is the ACM Digital Library metadata that includes metadata and textual information for research papers and journals submitted at various conferences and journals over the past 60 years. Our current system offers recommendations based on 80 binary classifiers. From our results, we could observe that for past submissions, our system recommends ground truth venues precisely. In the subsequent iterations of the project, we aim to improve the performance of individual classifiers and thereby offer better recommendations.
- Classification and extraction of information from ETD documentsAromando, John; Banerjee, Bipasha; Ingram, William A.; Jude, Palakh; Kahu, Sampanna (Virginia Tech, 2020-01-30)In recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning.
- Cross-Platform Data Collection and Analysis for Online Hate GroupsChandaluri, Rohit Kumar; Phadke, Shruti (Virginia Tech, 2019-12-26)Hate groups are using online social media increasingly over the last decade. An online audience of hate groups is exposed to the material with hateful agenda and underlying propaganda. The presence of hate across multiple social media platforms poses an important question for the research community: how do hate groups use different social media platforms differently? As a first step towards answering this question, we propose HateCorp: Cross-platform dataset of online hate group communication. In this project, we first identify various online hate groups and their Twitter, Facebook and YouTube accounts. Then we retrospectively collect the data over six months and present selected linguistic, social engagement and informational trends. In the future, we aim to expand this dataset in real-time along with creating a publicly accessible hate communication monitoring platform that could be useful to other researchers and social media policymakers.
- CS6604 Spring 2017 Global Events Team ProjectLi, Liuqing; Harb, Islam; Galad, Andrej (Virginia Tech, 2017-05-03)This submission describes the work the Global Events team completed in Spring 2017. It includes the final report and presentation, as well as key relevant materials (source code). Based on the previous reports and different modules created by former teams, the Global Events team established a pipeline for processing Web ARChives supporting the IDEAL and GETAR projects, both funded by NSF. With the Internet Archive’s help, the Global Events team enhanced the Event Focused Crawler to retrieve more relevant webpages (i.e., about school shooting events) in WARC format. ArchiveSpark, an Apache Spark framework that facilitates access to Web Archives, was deployed on a stand-alone server, and multiple techniques, such as parsing, Stanford NER, regular expression and statistical methods, were leveraged to process and analyze the data, and describe those events. For the data visualization, an integrated user interface using Gradle was designed and implemented for trend results, which can be easily used by both CS and non-CS researchers and students. Moreover, new well written manuals could be easier for users and developers to read and get familiar with ArchiveSpark, Spark, and Scala.
- ETDseer Concept PaperMa, Yufeng; Jiang, Tingting; Shrestha, Chandani (Virginia Tech, 2017-05-03)ETDSeer (electronic thesis and dissertation digital library connected with SeerSuite) will build on 15 years of collaboration between teams at Virginia Tech (VT) and Penn State University (PSU), since both have been leaders in the worldwide digital library (DL) community. VT helped launch the national and international efforts for ETDs more than 20 years ago, which have been led by the Networked Digital Library of Theses and Dissertations (NDLTD, directed by PI Fox); its Union Catalog has increased to 5 million records. PSU hosts CiteSeerX, which co-PI Giles launched almost 20 years ago, and which is connected with a wide variety of research results under the SeerSuite family. ETDs, typically in PDF, are a largely untapped international resource. Digital libraries with advanced services can effectively address the broad needs to discover and utilize ETDs of interest. Our research will leverage SeerSuite methods that have been applied mostly to short documents, plus a variety of exploratory studies at VT, and will yield a “web of graduate research”, rich knowledge bases, and a digital library with effective interfaces. References will be analyzed and converted to canonical forms, figures and tables will be recognized and re-represented for flexible searching, small sections (acknowledgments, biographical sketches) will be mined, and aids for researchers will be built especially from literature reviews and discussions of future work. Entity recognition and disambiguation will facilitate flexible use of a large graph of linked open data.
- Generating Synthetic Healthcare Records Using Convolutional Generative Adversarial NetworksTorfi, Amirsina; Beyki, Mohammadreza (Virginia Tech, 2019-12-20)Deep learning models have demonstrated high-quality performance in several areas such as image classification and speech processing. However, creating a deep learning model using electronic health record (EHR) data requires addressing particular privacy challenges that make this issue unique to researchers in this domain. This matter focuses attention on generating realistic synthetic data to amplify privacy. Existing methods for artificial data generation suffer from different limitations such as being bound to particular use cases. Furthermore, their generalizability to real-world problems is controversial regarding the uncertainties in defining and measuring key realistic characteristics. Henceforth, there is a need to establish insightful metrics (and to measure the validity of synthetic data), as well as quantitative criteria regarding privacy restrictions. We propose the use of Generative Adversarial Networks to help satisfy requirements for realistic characteristics and acceptable values of privacy metrics simultaneously. The present study makes several unique contributions to synthetic data generation in the healthcare domain. First, utilizing 1-D Convolutional Neural Networks (CNNs), we devise a new approach to capturing the correlation between adjacent diagnosis records. Second, we employ convolutional autoencoders to map the discrete-continuous values. Finally, we devise a new approach to measure the similarity between real and synthetic data, and a means to measure the fidelity of the synthetic data and its associated privacy risks.
- Sentiment and Topic AnalysisBartolome, Abigail; Bock, Matthew; Vinayagam, Radha Krishnan; Krishnamurthy, Rahul (Virginia Tech, 2017-05-03)The IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects have collected over 1.5 billion tweets, and webpages from social media and the World Wide Web and indexed them to be easily retrieved and analyzed. This gives researchers an extensive library of documents that reflect the interests and sentiments of the public in reaction to an event. By applying topic analysis to collections of tweets, researchers can learn the topics of most interest or concern to the general public. Adding a layer of sentiment analysis to those topics will illustrate how the public felt in relation to the topics that were found. The Sentiment and Topic Analysis team has designed a system that joins topic analysis and sentiment analysis for researchers who are interested in learning more about public reaction to global events. The tool runs topic analysis on a collection of tweets, and the user can select a topic of interest and assess the sentiments with regard to that topic (i.e., positive vs. negative). This submission covers the background, requirements, design and implementation of our contributions to this project. Furthermore, we include data, scripts, source code, a user manual, and a developer manual to assist in any future work.
- Social Communities Knowledge Discovery: Approaches applied to clinical studyChandrasekar, Prashant (Virginia Tech, 2017-05)In recent efforts being conducted by the Social Interactome team, to validate hypotheses of the study, we have worked to make sense of the data that has been collected during two 16-week experiments and three Amazon Mechanical Turk deployments. The complexity in the data has made it challenging to discover insights/patterns. The goal of the semester was to explore newer methods to analyze the data. Through such discovery, we can test/validate hypotheses about the data, that would provide a direction for our contextual inquiry to predict attributes and behavior of participants in the study. The report and slides highlight two possible approaches that employ statistical relational learning for structure learning and network classification. Related files include data and software used during this study; results are given from the analyses undertaken.
- Toward an Intelligent Crawling Scheduler for Archiving News Websites Using Reinforcement LearningWang, Xinyue; Ahuja, Naman; Llorens, Nathaniel; Bansal, Ritesh; Dhar, Siddharth (Virginia Tech, 2019-12-03)Web crawling is one of the fundamental activities for many kinds of web technology organizations and companies such as Internet Archive and Google. While companies like Google often focus on content delivery for users, web archiving organizations such as the Internet Archive pay more attention to the accurate preservation of the web. Crawling accuracy and efficiency are major concerns in this task. An ideal crawling module should be able to keep up with the changes in the target web site with minimal crawling frequency to maximize the routine crawling efficiency. In this project, we investigate using information from web archives' history to help the crawling process within the scope of news websites. We aim to build a smart crawling module that can predict web content change accurately both on the web page and web site structure level through modern machine learning algorithms and deep learning architectures. At the end of the project: We have collected and processed raw web archive collections from Archive.org and through our frequent crawling jobs. We have developed methods to extract identical copies of web page content and web site structure from the web archive data. We have implemented baseline models for predicting web page content change and web site structure change, web page content change with supervised machine learning algorithms; We have implemented two different reinforcement learning models for generating a web page crawling plan: a continuous prediction model and a sparse prediction model. Our results show that the reinforcement learning modal has the potential to work as an intelligent web crawling scheduler.
- Tweet Analysis and Classification: Diabetes and Heartbleed Internet Virus as Use CasesKarajeh, Ola; Arachie, Chidubem; Powell, Edward; Hussein, Eslam (Virginia Tech, 2019-12-24)The proliferation of data on social media has driven the need for researchers to develop algorithms to filter and process this data into meaningful information. In this project, we consider the task of classifying tweets relative to some topic or event and labeling them as informational or non-informational, using the features in the tweets. We focus on two collections from different domains: a diabetes dataset in the health domain and a heartbleed dataset in the security domain. We show the performance of our method in classifying tweets in the different collections. We employ two approaches to generate features for our models: 1) a graph based feature representation and 2) a vector space model, e.g., with TF-IDF weighting or a word embedding. The representations generated are fed into different machine learning algorithms (Logistic Regression, Naïve Bayes, and Decision Tree) to perform the classification task. We evaluate these approaches using metrics (accuracy, precision, recall, and F1-score) on a held out test dataset. Our results show that we can generalize our approach with tweets across different domains.