Reports, Digital Library Research Laboratory

Permanent URI for this collection

https://hdl.handle.net/10919/18734

Browse

Now showing 1 - 20 of 27

The Academy: A Community of Information Retrieval Agents
France, Robert K. (1994-09-06)
We commonly picture text as a sequence of words; or alternatively as a sequence of paragraphs, each of which is composed of a sequence of sentences, each of which is itself a sequence of words. It is also worth noting that text is not so much a sequence of words as a sequence of terms, including most commonly words, but also including names, numbers, code sequences, and a variety of other $#*&)&@^ tokens. Just as we commonly simplify text into a sequence of words, so too it is common in information retrieval to regard documents as single texts. Nothing is less common, though, than a document with only a single part, and that unstructured text. Search and retrieval in such a universe involves new questions: Where does a document begin and end? How can we decide how much to show to a user? When does a query need to be matched by a single node in a hypertext, and when may partial matches in several nodes count?
ArchiveSpark - MS Independent Study Final Submission
Galad, Andrej (Virginia Tech, 2016-12-13)
This project expands upon the work at the Internet Archive of researcher Vinay Goel and of Jefferson Bailey (co-PI on two NSF-funded collaborative projects with Virginia Tech: IDEAL, GETAR) on the ArchiveSpark project - a framework for efficient Web archive access, extraction, and derivation. The main goal of the project is to quantitatively and qualitatively evaluate ArchiveSpark against mainstream Web archive processing solutions and extend it as necessary with regard to the processing of testing collections. This also relates to an IMLS funded project. This report describes the efforts and contributions made as part of this project. The primary focus of these efforts lies in the comprehensive evaluation of ArchiveSpark against existing archive-processing solutions (pure Apache Spark with pre-installed Warcbase tools and HBase) in a variety of environments and setups in order to comparatively analyze performance improvements that ArchiveSpark brings to the table as well as understand the shortcomings and tradeoffs of its usage under varying scenarios.
Building the CODER Lexicon: The Collins English Dictionary and its Adverb Definitions
Fox, Edward A.; Wohlwend, Robert C.; Sheldon, Phyllis R.; Chen, Qi-Fan; France, Robert K. (1986-10-01)
The CODER (COmposite Document Expert/extended/effective Retrieval) project is an investigation of the applicability of artificial intelligence techniques to the information retrieval task of analyzing, storing, and retrieving heterogeneous collections of "composite documents. "In order to support some of the processing desired, and to allow experimentation in information retrieval and natural language processing, a lexicon was constructed from the machine readable Collins Dictionary of the English Language. After giving background, motivation, and a survey of related work, the Collins lexicon is discussed. Following is a description of the conversion process, the format of the resulting Prolog database, and characteristics of the dictionary and relations. To illustrate what is present and to explain how it relates to the files produced from Webster's Seventh New Collegiate Dictionary, a number of comparative charts are given. Finally, a grammar for adverb definitions is presented, together with a description of defining formula that usually indicate the type of the adverb. Ultimately it is hoped that definitions for adverbs and other words will be parsed so that the relational lexicon being constructed will include many additional relationships and other knowledge about words and their usage.
CTRnet Final Report
Fox, Edward A.; Shoemaker, Donald J.; Sheetz, Steven D.; Kavanaugh, Andrea L.; Ramakrishnan, Naren (2013-08-26)
The CTRnet project team has been developing a digital library including many webpage archives and tweet archives related to disasters, in collaboration with the Internet Archive. The goals of the CTRnet project are to provide such archived data sets for analysis, including by researchers who are seeking deep insights about those events, and to support a range of services and infrastructure regarding those tragic events for the various stakeholders and the general public, allowing them to study and learn.
CTRnet: Project Proposal to NSF
Fox, Edward A.; Shoemaker, Donald J.; Kavanaugh, Andrea L.; Ramakrishnan, Naren (2009)
Crises and tragedies are, regrettably, part of life; a recent sample, showing the small number of collections preserved at the Internet Archive, is shown in Table 1. While always difficult, recovery from tragic events may be increasingly facilitated and supported by information and communication technology (IC1). Individuals, groups, and communities are using ICT in innovative ways to learn from these events and recover more quickly and more effectively. During and after a crisis, individuals and communities face a confusing plethora of data and information, and strive to make sense by way of that data [114]. They seek to carry out their usual activities, but want to be informed by new insights. They work to help others, or to receive help, but the context and technologies involved in communication today (e.g., Internet, WWW, online communities, mobile devices) make it exceedingly difficult to integrate content, community, and services. Accordingly, individuals and communities respond by attempting to meet their needs with the tools they have, e.g., creating a Facebook group to quickly inform members who is OK, and other groups to share pictures, comments, and additional contributions.
A Database Driven Initial Ontology for Crisis, Tragedy, and Recovery
Sheetz, Steven D. (2011-05-01)
Many databases and supporting software have been developed to track the occurrences of natural disasters, manmade disasters, and combinations of the two. Each of the databases developed in this context, define their own representations of a disaster that describe the nature of the disaster and the data elements to be tracked for each type of disaster. The elements selected are not the same for the different databases, yet they are substantively similar. One capability common to many ontology development efforts is to describe data from diverse sources. Thus, we began our ontology development process by identifying several existing databases currently tracking disasters and derived the "ontology in situ" of their database. That is, we identified how the designers of the databases classify the types of disasters in their systems. We then merged these individual ontologies to identify an ontology that includes all of the classifications from the databases. Several aspects of disasters from the databases were highly consistent and therefore fit well together, e.g., the types of natural disasters, while others, e.g., geographic descriptions, were idiosyncratic and do not fit together seamlessly. The resulting ontology consists of 185 elements and has the potential to support data sharing/aggregation across the databases considered.
Extending Retrieval with Stepping Stones and Pathways
Fox, Edward A. (2003-08-01)
This project researches an alternative interpretation of user queries and presentation of the results. Instead of returning a ranked list of documents, the result of a query is a connected network of chains of evidence. Each chain is made of a sequence of additional concepts (stepping stones). Each concept in the sequence is logically connected to the next and previous one, and the chains provide a rationale (a pathway) for the connection between the two original concepts. To increase the user's understanding of the chain, it is desirable that the stepping stones be justified by concrete documents, along with the connections (relationships) among those documents.
High Performance Interoperable Digital Libraries in the Open Archives Initiative
Fox, Edward A.; Sanchez, J. Alfredo; Garza-Salazar, David (2004-01-31)
The scope of this project is high performance mechanisms for interoperable distributed digital repositories. We apply Open Archives Initiative ideas and concepts to the storage and retrieval of electronic theses and dissertations (ETDs), and work to make these more available to students by means of visualization tools.
Indexing Large Collections of Small Text Records for Ranked Retrieval
France, Robert K.; Fox, Edward A. (1993)
The MARIAN online public access catalog system at Virginia Tech has been developed to apply advanced information retrieval methods and object-oriented technology to the needs of library patrons. We give a description of our data model, design, processing, data representations, and retrieval operation. By identifying objects of interest during the indexing process, storing them according to our "information graph" model, and applying weighting schemes that seem appropriate for this large collection of small text records, we hope to better serve user needs. Since every text word is important in this domain, we employ opportunistic matching algorithms and a mix of data structures to support searching, that will give good performance for a large campus community, even though MARIAN runs on a distributed collection of small workstations. An initial small experiment indicates that our new ad hoc weighting scheme is more effective than a more standard approach.
Information Interactions: User Interface Objects for CODER, INCARD, and MARIAN, v. 2.5
France, Robert K. (1992-08-24)
Any information system needs a user interface: a program or program module that eases the communication between the system's users and the underlying search and storage software. This document describes (part of) the specifications for the user interface to a family of information systems current at Virginia Tech: the experimental platform CODER, a specialized version of CODER dealing with medical information called INCARD for INformation about CARDiology, and a library catalog system named MARIAN.
Integrated Digital Event Archiving and Library (IDEAL): Preview of Award 1319578 - Annual Project Report
Fox, Edward A.; Hanna, Kristine; Kavanaugh, Andrea L.; Sheetz, Steven D.; Shoemaker, Donald J. (2014-07-09)
The goals of this project are to ingest tweets and Web-based content from social media and the general Web, including news and governmental information. In addition to archiving materials found, the project team will build an information system that includes related metadata and knowledge bases, consistent with the 5S (Societies, Scenarios, Spaces, Structures, Streams) framework, along with results from our intelligent focused crawler, to support comprehensive access to event related content. With the support of key partners, the IDEAL team will undertake important research, education, and dissemination efforts, to achieve three complementary objectives: 1. Collecting: The project team will spot, identify, and make sense of interesting events. We also will accept specific or general requests about types of events. Given resource and sampling constraints, we will integrate methods to identify appropriate URLs as seeds, and specify when to start crawling and when to stop, with regard to each event or sub-event. We will integrate focused crawling and filtering approaches in order to ingest content and generate new collections, with high precision and recall. 2. Archiving & Accessing: Permanent archiving, and access to those archives, will be ensured by our partner, Internet Archive (IA). Immediate access to ingested content will be facilitated through big data software built on top of our new Hadoop cluster. 3. Analyzing & Visualizing: We will provide a wide range of integrated services beyond the usual (faceted) browsing and searching, including: classification, clustering, summarization, text mining, theme and topic identification, and visualization.
MARIAN Design
France, Robert K.; Cline, Ben E.; Fox, Edward A. (1995-02-14)
MARIAN (Multiple Access Retrieval of library Information with ANotations) is an online library catalog information system. Intended for library end-users rather than catalogers, it provides controlled search by author, subject entry, and imprint; keyword search by title, subject, and other MARC text fields; feedback, locating the closest books to a relevant book or books; and user annotations of books.
Microblogging in Crisis Situations: Mass Protests in Iran, Tunisia, Egypt
Kavanaugh, Andrea L.; Yang, Seungwon; Li, Lin Tzy; Sheetz, Steven D.; Fox, Edward A. (2011-05-01)
In this paper we briefly examine the use of Twitter in Iran, Tunisia and Egypt during the mass political demonstrations and protests in June 2009, December 2010 and January 2011 respectively. We compare this usage with methods and findings from other studies on the use of Twitter in emergency situations, such as natural and man-made disasters. We draw on my own experiences and participant-observations as an eyewitness in Iran, and on Twitter data from Tunisia and Egypt. In these three cases, Twitter filled a unique technology and communication gap at least partially. We summarize suggested directions for future research with a view of placing this work in the larger context of social media use in conditions of crisis or social convergence.
Multiple Metadata / Best Metadata Return
Suleman, Hussein; Nelson, Michael (2001-10-19)
The OAI protocol currently supports a simple mapping of metadata names to metadata formats, whereby a metadata record can be requested for exactly one record in exactly one format in a single GetRecord request. In the case of ListRecords, all records within a set and/or date range may be requested but there is still the restriction of a single metadata format. This is usually sufficient for simple harvesting with the intention of transferring a stream of metadata records from the source archive to a service provider. However, in some cases, it may be desirable to obtain the most complete metadata format or a set of metadata formats for an identifier. In order to accomplish this it is currently necessary to submit multiple requests with different parameters and this is not most efficient.
NSF 2nd Year Report: CTRnet: Integrated Digital Library Support for Crisis, Tragedy, and Recovery
Fox, Edward A.; Shoemaker, Donald J.; Sheetz, Steven D.; Kavanaugh, Andrea L.; Ramakrishnan, Naren (2011-07-01)
One of the important parts of this project is to collect and archive as much information as possible about various events that are related to crises, tragedies, and recovery (CTR). In order to do long-term archiving of information, we have worked with the Internet Archive (IA), a non-profit organization, whose goal is to archive the Internet. IA provides access to web crawlers that can be used to selectively crawl and archive webpages. In disaster situations, it is well known that people use micro-blogging sites such as Twitter to reach their family and friends especially when their cell phones are not working due to high volume of traffic on the cell phone network. For this reason, tweet posts sometimes report CTR events faster than the mainstream news media. Those tweets often contain more detailed information, too, reported by the affected people on the site. We have been archiving tweets (i.e., posts from Twitter.com) for both man-made and natural disaster events. Collected tweets can be exported in various formats including XSL, JSON, and HTML -- to be analyzed later using software tools.
NSF 3rd Year Report: CTRnet: Integrated Digital Library Support for Crisis, Tragedy, and Recovery
Fox, Edward A.; Shoemaker, Donald J.; Sheetz, Steven D.; Kavanaugh, Andrea L.; Ramakrishnan, Naren (2012-07-01)
The Crisis, Tragedy and Recovery (CTR) network, or CTRnet, is a human and digital library network for providing a range of services relating to different kinds of tragic events, including broad collaborative studies related to Egypt, Tunisia, Mexico, and Arlington, Virginia. Through this digital library, we collect and archive different types of CTR related information, and apply advanced information analysis methods to this domain. It is hoped that services provided through CTRnet can help communities, as they heal and recover from tragic events. We have taken several major steps towards our goal of building a digital library for CTR events. Different strategies for collecting comprehensive information surrounding various CTR events have been explored, initially using school shooting events as a testbed. Many GBs worth of related data has been collected using the web crawling tools and methodologies we developed. Several different methods for removing non-relevant pages (noise) from the crawled data have been explored. A focused crawler is being developed with the aim of providing users the ability to build high quality collections for CTR events focused on their interests. Use of social media for CTRnet related research is being explored. Software to integrate the popular social networking site Facebook with the CTRnet digital library has been prototyped, and is being developed further. Integration of the popular micro-blogging site Twitter with the CTRnet digital library has proceeded well, and is being further automated, becoming a key part of our methodology.
NSF Year 1 Report for CTRnet: Integrated Digital Library Support for Crisis, Tragedy, and Recovery
Fox, Edward A.; Shoemaker, Donald J.; Sheetz, Steven D.; Kavanaugh, Andrea L.; Ramakrishnan, Naren (2010-07-08)
The Crisis, Tragedy and Recovery network, or CTRnet, is a human and digital library network for providing a range of services relating to different kinds of tragic events. Through this digital library, we will collect and archive different types of CTR related information, and apply advanced information analysis methods to this domain. It is hoped that services provided through CTRnet can help communities, as they heal and recover from tragic events. We have taken several major steps towards our goal of building a digital library for CTR events. Different strategies for collecting comprehensive information surrounding various CTR events have been explored, using school shooting events as a testbed. Several GBs worth of school shootings related data has been collected using the web crawling tools and methodologies we developed. Several different methods for removing non-relevant pages (noise) from the crawled data have been explored. A focused crawler is being developed with the aim of providing users the ability to build high quality collections for CTR events focused on their interests. Use of social media for CTRnet related research is being explored. Software to integrate the popular social networking site Facebook with the CTRnet digital library has been prototyped, and is being developed further. Integration of the popular micro-blogging site Twitter with the CTRnet digital library is being explored.
Open Archives: Distributed Services for Physicists and Graduate Students (OAD)
Fox, Edward A.; Stamerjohanns, Heinrich; Hilf, Eberhard R.; Mittler, Elmar; Zia, Royce K. P. (2001)
This 2001-2002 report evaluates the research done to improve distributed digital library services for two user communities: physicists and graduate students.
Open Archives: Distributed Services for Physicists and Graduate Students (OAD)
Fox, Edward A.; Stamerjohanns, Heinrich; Hilf, Eberhard R.; Mittler, Elmar; Zia, Royce K. P. (2002)
The objective of this project - "Open Archives: Distributed Services for Physicists and Graduate Students OAD" - is to improve the quality of resources and distributed digital library services, aimed at two communities: physicists and graduate students. The approach is to apply Open Archives Initiative (OAI) ideas and concepts to the physics community and the Networked Digital Library of Theses and Dissertations (NDLTD).
Open Archives: Distributed Services for Physicists and Graduate Students (OAD)
Fox, Edward A.; Stamerjohanns, Heinrich; Hilf, Eberhard R.; Mittler, Elmar; Zia, Royce K. P. (2003)
This 2003 report evaluates the research done to improve distributed digital library services for two user communities: physicists and graduate students.

Browse

Browsing Reports, Digital Library Research Laboratory by Department "Computer Science"

Results Per Page

Sort Options