Reports, Digital Library Research Laboratory

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 20 of 28
  • Utilizing Docker and Kafka for Highly Scalable Bulk Processing of Electronic Theses and Dissertations (ETDs)
    Dinesh, Dhanush (Virginia Tech, 2023-05-09)
    This report discusses the utilization of Docker and Kafka for the bulk processing of Electronic Thesis and Dissertation (ETD) data. Docker, a containerization platform, was used to create portable Docker images that can be deployed on any platform, making them platform-agnostic. However, managing a large infrastructure with interconnected Docker containers can be complicated. To address this, Kafka, an open-source, distributed message streaming platform, was incorporated into the pipeline to make each service independent and scalable. The report provides a comprehensive discussion on how a pipeline was developed to maximize resource utilization and create a highly scalable infrastructure through the use of Docker and Kafka. Multiple Kafka brokers were deployed to ensure high availability and fault tolerance, and Zookeeper was used to track the status of Kafka nodes. Rancher was used to deploy the infrastructure on the cloud, which employs Kubernetes to manage the deployment and services. The report also highlights the advantages of the current setup over previous workflow automation in terms of processing time and parallel processing of data. The system design includes a Kafka producer that produces ETD IDs to be processed, and a segmentation container that acts as a consumer and polls the Kafka broker. Once the ETD IDs are received, the container starts processing, and the segmented chapters are stored in a shared Ceph file space. The process continues until all of the ETDs are processed. This integration has the potential to benefit researchers who require large amounts of ETD data processed at a scale that was previously unfeasible, enabling them to make more robust and data-driven conclusions.
  • SOLR supported search on an OpenStack metadata service
    Komawar, Nikhil (Virginia Tech, 2017-08-26)
    In cloud computing, the use of databases, particularly the MySQL database system, is a common practice. While the MySQL database system has advantages such as consistency and transaction support, some software architects believe that use of indexed search systems such as SOLR gives better read performance as compared to the traditionally deployed database servers. We propose an architecture that allows us to leverage the advantages of both systems. To study the same, we created a test bed with behavior similar to a real world scenario of a cloud metadata service, and compared the results of searching the metadata using a MySQL database with that of searching the same data using SOLR. We found that indexing the data using SOLR, although expensive in terms of disk space, gives us read performance orders of magnitude better than the MySQL database. These results may encourage cloud operators to try using SOLR for serving the users’ search requests, thereby avoiding API timeouts and slowness.
  • ArchiveSpark - MS Independent Study Final Submission
    Galad, Andrej (Virginia Tech, 2016-12-13)
    This project expands upon the work at the Internet Archive of researcher Vinay Goel and of Jefferson Bailey (co-PI on two NSF-funded collaborative projects with Virginia Tech: IDEAL, GETAR) on the ArchiveSpark project - a framework for efficient Web archive access, extraction, and derivation. The main goal of the project is to quantitatively and qualitatively evaluate ArchiveSpark against mainstream Web archive processing solutions and extend it as necessary with regard to the processing of testing collections. This also relates to an IMLS funded project. This report describes the efforts and contributions made as part of this project. The primary focus of these efforts lies in the comprehensive evaluation of ArchiveSpark against existing archive-processing solutions (pure Apache Spark with pre-installed Warcbase tools and HBase) in a variety of environments and setups in order to comparatively analyze performance improvements that ArchiveSpark brings to the table as well as understand the shortcomings and tradeoffs of its usage under varying scenarios.
  • NSF 3rd Year Report: CTRnet: Integrated Digital Library Support for Crisis, Tragedy, and Recovery
    Fox, Edward A.; Shoemaker, Donald J.; Sheetz, Steven D.; Kavanaugh, Andrea L.; Ramakrishnan, Naren (2012-07-01)
    The Crisis, Tragedy and Recovery (CTR) network, or CTRnet, is a human and digital library network for providing a range of services relating to different kinds of tragic events, including broad collaborative studies related to Egypt, Tunisia, Mexico, and Arlington, Virginia. Through this digital library, we collect and archive different types of CTR related information, and apply advanced information analysis methods to this domain. It is hoped that services provided through CTRnet can help communities, as they heal and recover from tragic events. We have taken several major steps towards our goal of building a digital library for CTR events. Different strategies for collecting comprehensive information surrounding various CTR events have been explored, initially using school shooting events as a testbed. Many GBs worth of related data has been collected using the web crawling tools and methodologies we developed. Several different methods for removing non-relevant pages (noise) from the crawled data have been explored. A focused crawler is being developed with the aim of providing users the ability to build high quality collections for CTR events focused on their interests. Use of social media for CTRnet related research is being explored. Software to integrate the popular social networking site Facebook with the CTRnet digital library has been prototyped, and is being developed further. Integration of the popular micro-blogging site Twitter with the CTRnet digital library has proceeded well, and is being further automated, becoming a key part of our methodology.
  • Why Students Use Social Networking Sites After Crisis Situations
    Sheetz, Steven D.; Fox, Edward A.; Fitzgerald, Andrew; Palmer, Sean; Shoemaker, Donald J.; Kavanaugh, Andrea L. (2011)
    Communities respond to tragedy by making virtuous use of social networking sites for a variety of purposes. We asked students to describe why they used a social networking site after the tragic shootings at Virginia Tech, then evaluated their responses using content analysis. Students went predominately to Facebook (99%). Most (59%) of the 426 students that responded went there because their friends were already there, and to find out if their friends were OK (28%) (and to let them know they were OK). Ideas related to relationships occurred more frequently in the responses than ideas related to the website's features. However, the ease of use of the website was mentioned often (22%). The results suggest this emergent phenomenon will recur.
  • NSF 2nd Year Report: CTRnet: Integrated Digital Library Support for Crisis, Tragedy, and Recovery
    Fox, Edward A.; Shoemaker, Donald J.; Sheetz, Steven D.; Kavanaugh, Andrea L.; Ramakrishnan, Naren (2011-07-01)
    One of the important parts of this project is to collect and archive as much information as possible about various events that are related to crises, tragedies, and recovery (CTR). In order to do long-term archiving of information, we have worked with the Internet Archive (IA), a non-profit organization, whose goal is to archive the Internet. IA provides access to web crawlers that can be used to selectively crawl and archive webpages. In disaster situations, it is well known that people use micro-blogging sites such as Twitter to reach their family and friends especially when their cell phones are not working due to high volume of traffic on the cell phone network. For this reason, tweet posts sometimes report CTR events faster than the mainstream news media. Those tweets often contain more detailed information, too, reported by the affected people on the site. We have been archiving tweets (i.e., posts from Twitter.com) for both man-made and natural disaster events. Collected tweets can be exported in various formats including XSL, JSON, and HTML -- to be analyzed later using software tools.
  • Integrated Digital Event Archiving and Library (IDEAL): Preview of Award 1319578 - Annual Project Report
    Fox, Edward A.; Hanna, Kristine; Kavanaugh, Andrea L.; Sheetz, Steven D.; Shoemaker, Donald J. (2014-07-09)
    The goals of this project are to ingest tweets and Web-based content from social media and the general Web, including news and governmental information. In addition to archiving materials found, the project team will build an information system that includes related metadata and knowledge bases, consistent with the 5S (Societies, Scenarios, Spaces, Structures, Streams) framework, along with results from our intelligent focused crawler, to support comprehensive access to event related content. With the support of key partners, the IDEAL team will undertake important research, education, and dissemination efforts, to achieve three complementary objectives: 1. Collecting: The project team will spot, identify, and make sense of interesting events. We also will accept specific or general requests about types of events. Given resource and sampling constraints, we will integrate methods to identify appropriate URLs as seeds, and specify when to start crawling and when to stop, with regard to each event or sub-event. We will integrate focused crawling and filtering approaches in order to ingest content and generate new collections, with high precision and recall. 2. Archiving & Accessing: Permanent archiving, and access to those archives, will be ensured by our partner, Internet Archive (IA). Immediate access to ingested content will be facilitated through big data software built on top of our new Hadoop cluster. 3. Analyzing & Visualizing: We will provide a wide range of integrated services beyond the usual (faceted) browsing and searching, including: classification, clustering, summarization, text mining, theme and topic identification, and visualization.
  • CTRnet Final Report
    Fox, Edward A.; Shoemaker, Donald J.; Sheetz, Steven D.; Kavanaugh, Andrea L.; Ramakrishnan, Naren (2013-08-26)
    The CTRnet project team has been developing a digital library including many webpage archives and tweet archives related to disasters, in collaboration with the Internet Archive. The goals of the CTRnet project are to provide such archived data sets for analysis, including by researchers who are seeking deep insights about those events, and to support a range of services and infrastructure regarding those tragic events for the various stakeholders and the general public, allowing them to study and learn.
  • Microblogging in Crisis Situations: Mass Protests in Iran, Tunisia, Egypt
    Kavanaugh, Andrea L.; Yang, Seungwon; Li, Lin Tzy; Sheetz, Steven D.; Fox, Edward A. (2011-05-01)
    In this paper we briefly examine the use of Twitter in Iran, Tunisia and Egypt during the mass political demonstrations and protests in June 2009, December 2010 and January 2011 respectively. We compare this usage with methods and findings from other studies on the use of Twitter in emergency situations, such as natural and man-made disasters. We draw on my own experiences and participant-observations as an eyewitness in Iran, and on Twitter data from Tunisia and Egypt. In these three cases, Twitter filled a unique technology and communication gap at least partially. We summarize suggested directions for future research with a view of placing this work in the larger context of social media use in conditions of crisis or social convergence.
  • Indexing Large Collections of Small Text Records for Ranked Retrieval
    France, Robert K.; Fox, Edward A. (1993)
    The MARIAN online public access catalog system at Virginia Tech has been developed to apply advanced information retrieval methods and object-oriented technology to the needs of library patrons. We give a description of our data model, design, processing, data representations, and retrieval operation. By identifying objects of interest during the indexing process, storing them according to our "information graph" model, and applying weighting schemes that seem appropriate for this large collection of small text records, we hope to better serve user needs. Since every text word is important in this domain, we employ opportunistic matching algorithms and a mix of data structures to support searching, that will give good performance for a large campus community, even though MARIAN runs on a distributed collection of small workstations. An initial small experiment indicates that our new ad hoc weighting scheme is more effective than a more standard approach.
  • MARIAN Design
    France, Robert K.; Cline, Ben E.; Fox, Edward A. (1995-02-14)
    MARIAN (Multiple Access Retrieval of library Information with ANotations) is an online library catalog information system. Intended for library end-users rather than catalogers, it provides controlled search by author, subject entry, and imprint; keyword search by title, subject, and other MARC text fields; feedback, locating the closest books to a relevant book or books; and user annotations of books.
  • The Academy: A Community of Information Retrieval Agents
    France, Robert K. (1994-09-06)
    We commonly picture text as a sequence of words; or alternatively as a sequence of paragraphs, each of which is composed of a sequence of sentences, each of which is itself a sequence of words. It is also worth noting that text is not so much a sequence of words as a sequence of terms, including most commonly words, but also including names, numbers, code sequences, and a variety of other $#*&)&@^ tokens. Just as we commonly simplify text into a sequence of words, so too it is common in information retrieval to regard documents as single texts. Nothing is less common, though, than a document with only a single part, and that unstructured text. Search and retrieval in such a universe involves new questions: Where does a document begin and end? How can we decide how much to show to a user? When does a query need to be matched by a single node in a hypertext, and when may partial matches in several nodes count?
  • Weights and Measures: An Axiomatic Model for Similarity Computations
    France, Robert K. (1994)
    This paper proposes a formal model for similarity functions, first over arbitrary objects, then over sets and the sorts of weighted sets that are found in text retrieval systems. Using a handful of axioms and constraints, we are able to make statements about the behavior of such functions in reference to set overlap and to noise. The model is then used to analyze, and we hope illuminate, several popular text similarity functions.
  • Extending Retrieval with Stepping Stones and Pathways
    Fox, Edward A. (2003-08-01)
    This project researches an alternative interpretation of user queries and presentation of the results. Instead of returning a ranked list of documents, the result of a query is a connected network of chains of evidence. Each chain is made of a sequence of additional concepts (stepping stones). Each concept in the sequence is logically connected to the next and previous one, and the chains provide a rationale (a pathway) for the connection between the two original concepts. To increase the user's understanding of the chain, it is desirable that the stepping stones be justified by concrete documents, along with the connections (relationships) among those documents.
  • Building the CODER Lexicon: The Collins English Dictionary and its Adverb Definitions
    Fox, Edward A.; Wohlwend, Robert C.; Sheldon, Phyllis R.; Chen, Qi-Fan; France, Robert K. (1986-10-01)
    The CODER (COmposite Document Expert/extended/effective Retrieval) project is an investigation of the applicability of artificial intelligence techniques to the information retrieval task of analyzing, storing, and retrieving heterogeneous collections of "composite documents. "In order to support some of the processing desired, and to allow experimentation in information retrieval and natural language processing, a lexicon was constructed from the machine readable Collins Dictionary of the English Language. After giving background, motivation, and a survey of related work, the Collins lexicon is discussed. Following is a description of the conversion process, the format of the resulting Prolog database, and characteristics of the dictionary and relations. To illustrate what is present and to explain how it relates to the files produced from Webster's Seventh New Collegiate Dictionary, a number of comparative charts are given. Finally, a grammar for adverb definitions is presented, together with a description of defining formula that usually indicate the type of the adverb. Ultimately it is hoped that definitions for adverbs and other words will be parsed so that the relational lexicon being constructed will include many additional relationships and other knowledge about words and their usage.
  • Information Interactions: User Interface Objects for CODER, INCARD, and MARIAN, v. 2.5
    France, Robert K. (1992-08-24)
    Any information system needs a user interface: a program or program module that eases the communication between the system's users and the underlying search and storage software. This document describes (part of) the specifications for the user interface to a family of information systems current at Virginia Tech: the experimental platform CODER, a specialized version of CODER dealing with medical information called INCARD for INformation about CARDiology, and a library catalog system named MARIAN.
  • When Stopping Rules Don't Stop
    France, Robert K. (1995)
    Performing ranked retrieval on large document collections can be slow. The method of stopping rules has been proposed to make it more efficient. Stopping rules, which terminate search when the highest ranked documents have been determined to some degree of likelihood, are attractive and have proven useful in clustering, but have not worked well in retrieval experiments. This paper presents a statistical analysis of why they have failed and where they can be expected to continue failing.
  • Open Archives: Distributed Services for Physicists and Graduate Students (OAD)
    Fox, Edward A.; Stamerjohanns, Heinrich; Hilf, Eberhard R.; Mittler, Elmar; Zia, Royce K. P. (2001)
    This 2001-2002 report evaluates the research done to improve distributed digital library services for two user communities: physicists and graduate students.
  • Multiple Metadata / Best Metadata Return
    Suleman, Hussein; Nelson, Michael (2001-10-19)
    The OAI protocol currently supports a simple mapping of metadata names to metadata formats, whereby a metadata record can be requested for exactly one record in exactly one format in a single GetRecord request. In the case of ListRecords, all records within a set and/or date range may be requested but there is still the restriction of a single metadata format. This is usually sufficient for simple harvesting with the intention of transferring a stream of metadata records from the source archive to a service provider. However, in some cases, it may be desirable to obtain the most complete metadata format or a set of metadata formats for an identifier. In order to accomplish this it is currently necessary to submit multiple requests with different parameters and this is not most efficient.
  • Set Orthogonality
    Suleman, Hussein; Zubair, Mohammad (2001-10-19)
    There is no way to determine all the sets that an identifier belongs to. This is typically referred to as set orthogonality because the protocol allows a harvester to find out which identifiers belong to a particular set but not vice versa. This is not as much of a problem for a flat space of archives, but organizations like NDLTD and NCSTRL have already started to create hierarchical catalogs based on OAI and existing set information is lost at the very first level. Also, the Internet2 Distributed Storage Initiative wants to work on replication of OAs - this will mean harvesting every set and dealing with duplicates. Can we do this in a way that is more efficient without adding to the complexity?