Browsing by Author "Gupta, Sandeep"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
- Efficient Spatio-Temporal Network Analytics in Epidemiological Studies using Distributed DatabasesKhan, Mohammed Saquib Akmal (Virginia Tech, 2015-01-26)Real-time Spatio-Temporal Analytics has become an integral part of Epidemiological studies. The size of the spatio-temporal data has been increasing tremendously over the years, gradually evolving into Big Data. The processing in such domains are highly data and compute intensive. High performance computing resources resources are actively being used to handle such workloads over massive datasets. This confluence of High performance computing and datasets with Big Data characteristics poses great challenges pertaining to data handling and processing. The resource management of supercomputers is in conflict with the data-intensive nature of spatio-temporal analytics. This is further exacerbated due to the fact that the data management is decoupled from the computing resources. Problems of these nature has provided great opportunities in the growth and development of tools and concepts centered around MapReduce based solutions. However, we believe that advanced relational concepts can still be employed to provide an effective solution to handle these issues and challenges. In this study, we explore distributed databases to efficiently handle spatio-temporal Big Data for epidemiological studies. We propose DiceX (Data Intensive Computational Epidemiology using supercomputers), which couples high-performance, Big Data and relational computing by embedding distributed data storage and processing engines within the supercomputer. It is characterized by scalable strategies for data ingestion, unified framework to setup and configure various processing engines, along with the ability to pause, materialize and restore images of a data session. In addition, we have successfully configured DiceX to support approximation algorithms from MADlib Analytics Library [54], primarily Count-Min Sketch or CM Sketch [33][34][35]. DiceX enables a new style of Big Data processing, which is centered around the use of clustered databases and exploits supercomputing resources. It can effectively exploit the cores, memory and compute nodes of supercomputers to scale processing of spatio-temporal queries on datasets of large volume. Thus, it provides a scalable and efficient tool for data management and processing of spatio-temporal data. Although DiceX has been designed for computational epidemiology, it can be easily extended to different data-intensive domains facing similar issues and challenges. We thank our external collaborators and members of the Network Dynamics and Simulation Science Laboratory (NDSSL) for their suggestions and comments. This work has been partially supported by DTRA CNIMS Contract HDTRA1-11-D-0016-0001, DTRA Validation Grant HDTRA1-11-1-0016, NSF - Network Science and Engineering Grant CNS-1011769, NIH and NIGMS - Models of Infectious Disease Agent Study Grant 5U01GM070694-11. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government.
- A High Performance C++ Generic Benchmark for Computational EpidemiologyPugaonkar, Aniket Narayan (Virginia Tech, 2015-01-31)An effective tool used by planners and policy makers in public health, such as Center for Disease Control (CDC), to curtail spread of infectious diseases over a given population is contagion diffusion simulations. These simulations model the relevant characteristics of the population (age, gender, income etc.) and the disease (attack rate, etc.) and compute the spread under various configuration and plausible intervention strategies (such as vaccinations, school closure, etc.). Hence, the model and the computation form a complex agent based system and are highly compute and resource intensive. In this work, we design a benchmark consisting of several kernels which capture the essential compute, communication, and data access patterns for such applications. For each kernel, the benchmark provides different evaluation strategies. The goal is to (a) derive alternative implementations for computing the contagion by combining different implementation of the kernels, and (b) evaluate which combination of implementation, runtime, and hardware is most effective in running large scale contagion diffusion simulations. Our proposed benchmark is designed using C++ generic programming primitives and lifting sequential strategies for parallel computations. Together, these lead to a succinct description of the benchmark and significant code reuse when deriving strategies for new hardware. For the benchmark to be effective, this aspect is crucial, because the potential combination of hardware and runtime are growing rapidly thereby making infeasible to write optimized strategy for the complete contagion diffusion from ground up for each compute system.
- Modeling and Computation of Complex Interventions in Large-scale Epidemiological Simulations using SQL and Distributed DatabaseKaw, Rushi (Virginia Tech, 2014-08-30)Scalability is an important problem in epidemiological applications that simulate complex intervention scenarios over large datasets. Indemics is one such interactive data intensive framework for High-performance computing (HPC) based large-scale epidemic simulations. In the Indemics framework, interventions are supplied from an external, standalone database which proved to be an effective way of implementing interventions. Although this setup performs well for simple interventions and small datasets, performance and scalability of complex interventions and large datasets remain an issue. In this thesis, we present IndemicsXC, a scalable and massively parallel high-performance data engine for Indemics in a supercomputing environment. IndemicsXC has the ability to implement complex interventions over large datasets. Our distributed database solution retains the simplicity of Indemics by using the same SQL query interface for expressing interventions. We show that our solution implements the most complex interventions by intelligently offloading them to the supercomputer nodes and processing them in parallel. We present an extensive performance evaluation of our database engine with the help of various intervention case studies over synthetic population datasets. The evaluation of our parallel and distributed database framework illustrates its scalability over standalone database. Our results show that the distributed data engine is efficient as it is parallel, scalable and cost-efficient means of implementing interventions. The proposed cost-model in this thesis could be used to approximate intervention query execution time with decent accuracy. The usefulness of our distributed database framework could be leveraged for fast, accurate and sensible decisions by the public health officials during an outbreak. Finally, we discuss the considerations for using distributed databases for driving large-scale simulations.
- Relational Computing Using HPC Resources: Services and OptimizationsSoundarapandian, Manikandan (Virginia Tech, 2015-09-15)Computational epidemiology involves processing, analysing and managing large volumes of data. Such massive datasets cannot be handled efficiently by using traditional standalone database management systems, owing to their limitation in the degree of computational efficiency and bandwidth to scale to large volumes of data. In this thesis, we address management and processing of large volumes of data for modeling, simulation and analysis in epidemiological studies. Traditionally, compute intensive tasks are processed using high performance computing resources and supercomputers whereas data intensive tasks are delegated to standalone databases and some custom programs. DiceX framework is a one-stop solution for distributed database management and processing and its main mission is to leverage and utilize supercomputing resources for data intensive computing, in particular relational data processing. While standalone databases are always on and a user can submit queries at any time for required results, supercomputing resources must be acquired and are available for a limited time period. These resources are relinquished either upon completion of execution or at the expiration of the allocated time period. This kind of reservation based usage style poses critical challenges, including building and launching a distributed data engine onto the supercomputer, saving the engine and resuming from the saved image, devising efficient optimization upgrades to the data engine and enabling other applications to seamlessly access the engine . These challenges and requirements cause us to align our approach more closely with cloud computing paradigms of Infrastructure as a Service(IaaS) and Platform as a Service(PaaS). In this thesis, we propose cloud computing like workflows, but using supercomputing resources to manage and process relational data intensive tasks. We propose and implement several services including database freeze and migrate and resume, ad-hoc resource addition and table redistribution. These services assist in carrying out the workflows defined. We also propose an optimization upgrade to the query planning module of postgres-XC, the core relational data processing engine of the DiceX framework. With a knowledge of domain semantics, we have devised a more robust data distribution strategy that would enable to push down most time consuming sql operations forcefully to the postgres-XC data nodes, bypassing its query planner's default shippability criteria without compromising correctness. Forcing query push down reduces the query processing time by a factor of almost 40%-60% for certain complex spatio-temporal queries on our epidemiology datasets. As part of this work, a generic broker service has also been implemented, which acts as an interface to the DiceX framework by exposing restful apis, which applications can make use of to query and retrieve results irrespective of the programming language or environment.
- A Semantic Web-Based Digital Library Infrastructure to Facilitate Computational EpidemiologyHasan, S. M. Shamimul (Virginia Tech, 2017-09-15)Computational epidemiology generates and utilizes massive amounts of data. There are two primary categories of datasets: reported and synthetic. Reported data include epidemic data published by organizations (e.g., WHO, CDC, other national ministries and departments of health) during and following actual outbreaks, while synthetic datasets are comprised of spatially explicit synthetic populations, labeled social contact networks, multi-cell statistical experiments, and output data generated from the execution of computer simulation experiments. The discipline of computational epidemiology encounters numerous challenges because of the size, volume, and dynamic nature of both types of these datasets. In this dissertation, we present semantic web-based schemas to organize diverse reported and synthetic computational epidemiology datasets. There are three layers of these schemas: conceptual, logical, and physical. The conceptual layer provides data abstraction by exposing common entities and properties to the end user. The logical layer captures data fragmentation and linking aspects of the datasets. The physical layer covers storage aspects of the datasets. We can create mapping files from the schemas. The schemas are flexible and can grow. The schemas presented include data linking approaches that can connect large-scale and widely varying epidemic datasets. This linked data leads to an integrated knowledge-base, enabling an epidemiologist to ask complex queries that employ multiple datasets. We demonstrate the utility of our knowledge-base by developing a query bank, which represents typical analyses carried out by an epidemiologist during the course of planning for or responding to an epidemic. By running queries with different data mapping techniques, we demonstrate the performance of various tools. The empirical results show that leveraging semantic web technology is an effective strategy for: reasoning over multiple datasets simultaneously, developing network queries pertinent in an epidemic analysis, and conducting realistic studies undertaken in an epidemic investigation. The performance of queries varies according to the choice of hardware, underlying database, and resource description framework (RDF) engine. We provide application programming interfaces (APIs) on top of our linked datasets, which an epidemiologist can use for information retrieval, without knowing much about underlying datasets. The proposed semantic web-based digital library infrastructure can be highly beneficial for epidemiologists as they work to comprehend disease propagation for timely outbreak detection and efficient disease control activities.
- Stability and Loads Validation of an Ocean Current TurbineSwales, Henry; Coackley, Dave; Gupta, Sandeep; Way, Stephen (2014-04)The design of a moored ocean current turbine presents many engineering challenges; among them are accurately predicting the stability and loads of the device. To validate computational loads and stability prediction tools, Aquantis Inc. designed, built, and tested a 1/25th scale model of their ‘C‐Plane' dual‐rotor moored ocean current turbine. This effort was conducted in cooperation with the US Naval Surface Warfare Center at the David Taylor Model Basin and was funded in part under a grant awarded to Dehlsen Associates by the U.S. Department of Energy. This multi‐stage testing effort included both a captured singlerotor test and a dynamic, moored test of the complete dual‐rotor C‐Plane. The test data is subsequently used to validate a variety of stability and loads simulations including the Navy's DCAB Code and Tidal Bladed v4.4. Specialized testing methodologies were developed for this purpose and the results are compared with computational model predictions. This testing effort investigates many aspects of moored ocean current turbine design. The captured test was essential to characterize rotor loads and stability coefficients at various blade pitch and cone angles, as well as measure rotational stall delay and unsteady rotor loads due to upstream structure wakes. The dynamic test validated stability and loads predictions of all anticipated modes of deployment and operation, depth keeping and loads avoidance, yawed flow behavior, and various failure modes. An extensive suite of sensors is employed on the C‐Plane test model including: 6 degree‐of‐freedom (DOF) load cells, 6‐DOF inertial measurement and heading sensors, rotor torque, rotor rpm, rotor position, static pressure/depth, tow speed, and mooring tension. These sensors provide a comprehensive understanding of the C‐Plane motion and essential loads during testing. A 400Hz sample rate is utilized to accurately capture transient events. The model rotors have a high degree of controllability including rampup/ ramp‐down, counter‐rotating synchronization and phase‐shift, and constant tip‐speed‐ratio regulation. Many challenging aspects of testing a moored ocean current turbine have been addressed in this effort, such as: very low Reynolds number scaled rotor design and fabrication, development of a mooring test rig capable of yawed flow, and simulating the motions of a dual rotor moored device. This test program has proven that the CPlane design has a high degree of stability in a wide range of flow conditions and computational models are capable of accurately predicting CPlane behavior.
- WiSDM: a platform for crowd-sourced data acquisition, analytics, and synthetic data generationChoudhury, Ananya (Virginia Tech, 2016-08-15)Human behavior is a key factor influencing the spread of infectious diseases. Individuals adapt their daily routine and typical behavior during the course of an epidemic -- the adaptation is based on their perception of risk of contracting the disease and its impact. As a result, it is desirable to collect behavioral data before and during a disease outbreak. Such data can help in creating better computer models that can, in turn, be used by epidemiologists and policy makers to better plan and respond to infectious disease outbreaks. However, traditional data collection methods are not well suited to support the task of acquiring human behavior related information; especially as it pertains to epidemic planning and response. Internet-based methods are an attractive complementary mechanism for collecting behavioral information. Systems such as Amazon Mechanical Turk (MTurk) and online survey tools provide simple ways to collect such information. This thesis explores new methods for information acquisition, especially behavioral information that leverage this recent technology. Here, we present the design and implementation of a crowd-sourced surveillance data acquisition system -- WiSDM. WiSDM is a web-based application and can be used by anyone with access to the Internet and a browser. Furthermore, it is designed to leverage online survey tools and MTurk; WiSDM can be embedded within MTurk in an iFrame. WiSDM has a number of novel features, including, (i) ability to support a model-based abductive reasoning loop: a flexible and adaptive information acquisition scheme driven by causal models of epidemic processes, (ii) question routing: an important feature to increase data acquisition efficacy and reduce survey fatigue and (iii) integrated surveys: interactive surveys to provide additional information on survey topic and improve user motivation. We evaluate the framework's performance using Apache JMeter and present our results. We also discuss three other extensions of WiSDM: Adapter, Synthetic Data Generator, and WiSDM Analytics. The API Adapter is an ETL extension of WiSDM which enables extracting data from disparate data sources and loading to WiSDM database. The Synthetic Data Generator allows epidemiologists to build synthetic survey data using NDSSL's Synthetic Population as agents. WiSDM Analytics empowers users to perform analysis on the data by writing simple python code using Versa APIs. We also propose a data model that is conducive to survey data analysis.