Browsing by Author "Choudhury, Ananya"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Solr Team Project ReportGruss, Richard; Choudhury, Ananya; Komawar, Nikhil (2015-05-13)The Integrated Digital Event Archive and Library (IDEAL) is a Digital Library project that aims to collect, index, archive and provide access to digital contents related to important events, including disasters, man-made or natural. It extracts event data mostly from social media sites such as Twitter and crawls related web. However, the volume of information currently on the web on any event is enormous and highly noisy, making it extremely difficult to get all specific information. The objective of this course is to build a state-of-the-art information retrieval system in support of the IDEAL project. The class was divided into eight teams, each team being assigned a part of the project that when successfully implemented will enhance the IDEAL project’s functionality. The final product, which will be the culmination of these 8 teams’ efforts, is a fast and efficient search engine for events occurring around the world. This report describes the work completed by the Solr team as a contribution towards searching and retrieving the tweets and web pages archived by IDEAL. If we can visualize the class project as a tree structure, then Solr is the root of the tree, which builds on all other team’s efforts. Hence we actively interacted with all other teams to come up with a generic schema for the documents and their corresponding metadata to be indexed by Solr. As Solr interacts with HDFS via HBase where the data is stored, we also defined an HBase schema and configured the Lily Indexer to set up a fast communication between HBase and Solr. We batch-indexed 8.5 million of the 84 million tweets before encountering memory limitations on the single-node Solr installation. Focusing our efforts therefore on building a search experience around the small collections, we created a 3.4-million tweet collection and a 12,000-webpage collection. Our custom search, which leverages the differential field weights in Solr’s edismax Query Parser and two custom Query Components, achieved precision levels in excess of 90%.
- WiSDM: a platform for crowd-sourced data acquisition, analytics, and synthetic data generationChoudhury, Ananya (Virginia Tech, 2016-08-15)Human behavior is a key factor influencing the spread of infectious diseases. Individuals adapt their daily routine and typical behavior during the course of an epidemic -- the adaptation is based on their perception of risk of contracting the disease and its impact. As a result, it is desirable to collect behavioral data before and during a disease outbreak. Such data can help in creating better computer models that can, in turn, be used by epidemiologists and policy makers to better plan and respond to infectious disease outbreaks. However, traditional data collection methods are not well suited to support the task of acquiring human behavior related information; especially as it pertains to epidemic planning and response. Internet-based methods are an attractive complementary mechanism for collecting behavioral information. Systems such as Amazon Mechanical Turk (MTurk) and online survey tools provide simple ways to collect such information. This thesis explores new methods for information acquisition, especially behavioral information that leverage this recent technology. Here, we present the design and implementation of a crowd-sourced surveillance data acquisition system -- WiSDM. WiSDM is a web-based application and can be used by anyone with access to the Internet and a browser. Furthermore, it is designed to leverage online survey tools and MTurk; WiSDM can be embedded within MTurk in an iFrame. WiSDM has a number of novel features, including, (i) ability to support a model-based abductive reasoning loop: a flexible and adaptive information acquisition scheme driven by causal models of epidemic processes, (ii) question routing: an important feature to increase data acquisition efficacy and reduce survey fatigue and (iii) integrated surveys: interactive surveys to provide additional information on survey topic and improve user motivation. We evaluate the framework's performance using Apache JMeter and present our results. We also discuss three other extensions of WiSDM: Adapter, Synthetic Data Generator, and WiSDM Analytics. The API Adapter is an ETL extension of WiSDM which enables extracting data from disparate data sources and loading to WiSDM database. The Synthetic Data Generator allows epidemiologists to build synthetic survey data using NDSSL's Synthetic Population as agents. WiSDM Analytics empowers users to perform analysis on the data by writing simple python code using Versa APIs. We also propose a data model that is conducive to survey data analysis.