Browsing by Author "Leman, Scotland C."
Now showing 1 - 20 of 46
Results Per Page
Sort Options
- Adaptive Threshold Method for Monitoring Rates in Public Health SurveillanceGan, Linmin (Virginia Tech, 2010-04-30)We examine some of the methodologies implemented by the Centers for Disease Control and Prevention's (CDC) BioSense program. The program uses data from hospitals and public health departments to detect outbreaks using the Early Aberration Reporting System (EARS). The EARS method W2 allows one to monitor syndrome counts (W2count) from each source and the proportion of counts of a particular syndrome relative to the total number of visits (W2rate). We investigate the performance of the W2r method designed using an empiric recurrence interval (RI) in this dissertation research. An adaptive threshold monitoring method is introduced based on fitting sample data to the underlying distributions, then converting the current value to a Z-score through a p-value. We compare the upper thresholds on the Z-scores required to obtain given values of the recurrence interval for different sets of parameter values. We then simulate one-week outbreaks in our data and calculate the proportion of times these methods correctly signal an outbreak using Shewhart and exponentially weighted moving average (EWMA) charts. Our results indicate the adaptive threshold method gives more consistent statistical performance across different parameter sets and amounts of baseline historical data used for computing the statistics. For the power analysis, the EWMA chart is superior to its Shewhart counterpart in nearly all cases, and the adaptive threshold method tends to outperform the W2 rate method. Two modified W2r methods proposed in the dissertation also tend to outperform the W2r method in terms of the RI threshold functions and in the power analysis.
- Advances in the Use of Finite-Set Statistics for Multitarget TrackingJimenez, Jorge Gabriel (Virginia Tech, 2021-10-27)In this dissertation, we seek to improve and advance the use of the finite-set statistics (FISST) approach to multitarget tracking. We consider a subsea multitarget tracking application that poses several challenges due to factors, such as, clutter/environmental noise, joint target and sensor state dependent measurement uncertainty, target-measurement association ambiguity, and sub-optimal sensor placement. The specific application that we consider is that of an underwater mobile sensor that measures the relative angle (i.e., bearing angle) to sources of acoustic noise in order to track one or more ships (targets) in a noisy environment. However, our contributions are generalizable for a variety of multitarget tracking applications. We build upon existing algorithms and address the problem of improving tracking performance for multiple maneuvering targets by incorporation several target motion models into a FISST tracking algorithm known as the probability hypothesis density filter. Moreover, we develop a novel method for associating measurements to targets using the Bayes factor, which improves tracking performance for FISST methods as well as other approaches to multitarget tracking. Further, we derive a novel formulation of Bayes risk for use with set-valued random variables and develop a real-time planner for sensor motion that avoids local minima that arise in myopic approaches to sensor motion planning. The effectiveness of our contributions are evaluated through a mixture of real-world and simulated data.
- Air quality economics: Three essaysYao, Zhenyu (Virginia Tech, 2022-06-17)This dissertation consists of three separate research projects. Each paper uses a different applied econometric technique to investigate problems related to air quality economics. The first chapter is a general introduction to all three studies. The second chapter explores adopting an environmentally-friendly public transportation system in Europe. The Bayesian econometric methods show that willingness to pay for a new public transportation system is primarily driven by improvements to public goods, such as air quality and greenhouse gas emission reduction. The third chapter uses the red tide-related stated experience and satellite imagery of chlorophyll-a concentration as well as field data of respiratory irritation. This chapter illustrates that ancillary scientific information can be efficiently combined with choice experimental data. The fourth chapter uses panel fixed-effect models to investigate the short-term effect of air pollution on students' cognitive performance in China. It is shown that PM2.5 has a significantly negative impact on students' exam performance.
- Anomaly Detection in Aeroacoustic Wind Tunnel ExperimentsDefreitas, Aaron Chad (Virginia Tech, 2021-10-27)Wind tunnel experiments often employ a wide variety and large number of sensor systems. Anomalous measurements occurring without the knowledge of the researcher can be devastating to the success of costly experiments; therefore, anomaly detection is of great interest to the wind tunnel community. Currently, anomaly detection in wind tunnel data is a manual procedure. A researcher will analyze the quality of measurements, such as monitoring for pressure measurements outside of an expected range or additional variability in a time averaged quantity. More commonly, the raw data must be fully processed to obtain near-final results during the experiment for an effective review. Rapid anomaly detection methods are desired to ensure the quality of a measurement and reduce the load on the researcher. While there are many effective methodologies for anomaly detection used throughout the wider engineering research community, they have not been demonstrated in wind tunnel experiments. Wind tunnel experimentation is unique in the sense that many repeat measurements are not typical. Typically, this will only occur if an anomaly has been identified. Since most anomaly detection methodologies rely on well-resolved knowledge of a measurement to uncover the expected uncertainties, they can be difficult to apply in the wind tunnel setting. First, the analysis will focus on pressure measurements around an airfoil and its wake. Principal component analysis (PCA) will be used to build a measurement expectation by linear estimation. A covariance matrix will be constructed from experimental data to be used in the PCA-scheme. This covariance matrix represents both the strong deterministic relations dependent on experimental configuration as well as random uncertainty. Through principles of ideal flow, a method to normalize geometrical changes to improve measurement expectations will be demonstrated. Measurements from a microphone array, another common system employed in aeroacoustic wind tunnels, will be analyzed similarly through evaluation of the cross-spectral matrix of microphone data, with minimal repeat measurements. A spectral projection method will be proposed that identifies unexpected acoustic source distributions. Analysis of good and anomalous measurements show this methodology is effective. Finally, machine learning technique will be investigated for an experimental situation where repeat measurements of a known event are readily available. A convolutional neural network for feature detection will be shown in the context of audio detection. This dissertation presents techniques for anomaly detection in sensor systems commonly used in wind tunnel experiments. The presented work suggests that these anomaly identification techniques can be easily introduced into aeroacoustic experiment methodology, minimizing tunnel down time, and reducing cost.
- Arm-specific dynamics of chromosome evolution in malaria mosquitoesSharakhova, Maria V.; Xia, Ai; Leman, Scotland C.; Sharakhov, Igor V. (Biomed Central, 2011-04-07)Background: The malaria mosquito species of subgenus Cellia have rich inversion polymorphisms that correlate with environmental variables. Polymorphic inversions tend to cluster on the chromosomal arms 2R and 2L but not on X, 3R and 3L in Anopheles gambiae and homologous arms in other species. However, it is unknown whether polymorphic inversions on homologous chromosomal arms of distantly related species from subgenus Cellia nonrandomly share similar sets of genes. It is also unclear if the evolutionary breakage of inversion-poor chromosomal arms is under constraints. Results: To gain a better understanding of the arm-specific differences in the rates of genome rearrangements, we compared gene orders and established syntenic relationships among Anopheles gambiae, Anopheles funestus, and Anopheles stephensi. We provided evidence that polymorphic inversions on the 2R arms in these three species nonrandomly captured similar sets of genes. This nonrandom distribution of genes was not only a result of preservation of ancestral gene order but also an outcome of extensive reshuffling of gene orders that created new combinations of homologous genes within independently originated polymorphic inversions. The statistical analysis of distribution of conserved gene orders demonstrated that the autosomal arms differ in their tolerance to generating evolutionary breakpoints. The fastest evolving 2R autosomal arm was enriched with gene blocks conserved between only a pair of species. In contrast, all identified syntenic blocks were preserved on the slowly evolving 3R arm of An. gambiae and on the homologous arms of An. funestus and An. stephensi. Conclusions: Our results suggest that natural selection favors specific gene combinations within polymorphic inversions when distant species are exposed to similar environmental pressures. This knowledge could be useful for the discovery of genes responsible for an association of inversion polymorphisms with phenotypic variations in multiple species. Our data support the chromosomal arm specificity in rates of gene order disruption during mosquito evolution. We conclude that the distribution of breakpoint regions is evolutionary conserved on slowly evolving arms and tends to be lineage-specific on rapidly evolving arms.
- Bayesian Approach Dealing with Mixture Model ProblemsZhang, Huaiye (Virginia Tech, 2012-04-23)In this dissertation, we focus on two research topics related to mixture models. The first topic is Adaptive Rejection Metropolis Simulated Annealing for Detecting Global Maximum Regions, and the second topic is Bayesian Model Selection for Nonlinear Mixed Effects Model. In the first topic, we consider a finite mixture model, which is used to fit the data from heterogeneous populations for many applications. An Expectation Maximization (EM) algorithm and Markov Chain Monte Carlo (MCMC) are two popular methods to estimate parameters in a finite mixture model. However, both of the methods may converge to local maximum regions rather than the global maximum when multiple local maxima exist. In this dissertation, we propose a new approach, Adaptive Rejection Metropolis Simulated Annealing (ARMS annealing), to improve the EM algorithm and MCMC methods. Combining simulated annealing (SA) and adaptive rejection metropolis sampling (ARMS), ARMS annealing generate a set of proper starting points which help to reach all possible modes. ARMS uses a piecewise linear envelope function for a proposal distribution. Under the SA framework, we start with a set of proposal distributions, which are constructed by ARMS, and this method finds a set of proper starting points, which help to detect separate modes. We refer to this approach as ARMS annealing. By combining together ARMS annealing with the EM algorithm and with the Bayesian approach, respectively, we have proposed two approaches: an EM ARMS annealing algorithm and a Bayesian ARMS annealing approach. EM ARMS annealing implement the EM algorithm by using a set of starting points proposed by ARMS annealing. ARMS annealing also helps MCMC approaches determine starting points. Both approaches capture the global maximum region and estimate the parameters accurately. An illustrative example uses a survey data on the number of charitable donations. The second topic is related to the nonlinear mixed effects model (NLME). Typically a parametric NLME model requires strong assumptions which make the model less flexible and often are not satisfied in real applications. To allow the NLME model to have more flexible assumptions, we present three semiparametric Bayesian NLME models, constructed with Dirichlet process (DP) priors. Dirichlet process models often refer to an infinite mixture model. We propose a unified approach, the penalized posterior Bayes factor, for the purpose of model comparison. Using simulation studies, we compare the performance of two of the three semiparametric hierarchical Bayesian approaches with that of the parametric Bayesian approach. Simulation results suggest that our penalized posterior Bayes factor is a robust method for comparing hierarchical parametric and semiparametric models. An application to gastric emptying studies is used to demonstrate the advantage of our estimation and evaluation approaches.
- Bayesian Hierarchical Latent Model for Gene Set AnalysisChao, Yi (Virginia Tech, 2009-04-29)Pathway is a set of genes which are predefined and serve a particular celluar or physiological function. Ranking pathways relevant to a particular phenotype can help researchers focus on a few sets of genes in pathways. In this thesis, a Bayesian hierarchical latent model was proposed using generalized linear random effects model. The advantage of the approach was that it can easily incorporate prior knowledges when the sample size was small and the number of genes was large. For the covariance matrix of a set of random variables, two Gaussian random processes were considered to construct the dependencies among genes in a pathway. One was based on the polynomial kernel and the other was based on the Gaussian kernel. Then these two kernels were compared with constant covariance matrix of the random effect by using the ratio, which was based on the joint posterior distribution with respect to each model. For mixture models, log-likelihood values were computed at different values of the mixture proportion, compared among mixtures of selected kernels and point-mass density (or constant covariance matrix). The approach was applied to a data set (Mootha et al., 2003) containing the expression profiles of type II diabetes where the motivation was to identify pathways that can discriminate between normal patients and patients with type II diabetes.
- Bayesian Visual Analytics: Interactive Visualization for High Dimensional DataHan, Chao (Virginia Tech, 2012-12-07)In light of advancements made in data collection techniques over the past two decades, data mining has become common practice to summarize large, high dimensional datasets, in hopes of discovering noteworthy data structures. However, one concern is that most data mining approaches rely upon strict criteria that may mask information in data that analysts may find useful. We propose a new approach called Bayesian Visual Analytics (BaVA) which merges Bayesian Statistics with Visual Analytics to address this concern. The BaVA framework enables experts to interact with the data and the feature discovery tools by modeling the "sense-making" process using Bayesian Sequential Updating. In this paper, we use BaVA idea to enhance high dimensional visualization techniques such as Probabilistic PCA (PPCA). However, for real-world datasets, important structures can be arbitrarily complex and a single data projection such as PPCA technique may fail to provide useful insights. One way for visualizing such a dataset is to characterize it by a mixture of local models. For example, Tipping and Bishop [Tipping and Bishop, 1999] developed an algorithm called Mixture Probabilistic PCA (MPPCA) that extends PCA to visualize data via a mixture of projectors. Based on MPPCA, we developped a new visualization algorithm called Covariance-Guided MPPCA which group similar covariance structured clusters together to provide more meaningful and cleaner visualizations. Another way to visualize a very complex dataset is using nonlinear projection methods such as the Generative Topographic Mapping algorithm(GTM). We developped an interactive version of GTM to discover interesting local data structures. We demonstrate the performance of our approaches using both synthetic and real dataset and compare our algorithms with existing ones.
- The Cauchy-Net Mixture Model for Clustering with Anomalous DataSlifko, Matthew D. (Virginia Tech, 2019-09-11)We live in the data explosion era. The unprecedented amount of data offers a potential wealth of knowledge but also brings about concerns regarding ethical collection and usage. Mistakes stemming from anomalous data have the potential for severe, real-world consequences, such as when building prediction models for housing prices. To combat anomalies, we develop the Cauchy-Net Mixture Model (CNMM). The CNMM is a flexible Bayesian nonparametric tool that employs a mixture between a Dirichlet Process Mixture Model (DPMM) and a Cauchy distributed component, which we call the Cauchy-Net (CN). Each portion of the model offers benefits, as the DPMM eliminates the limitation of requiring a fixed number of a components and the CN captures observations that do not belong to the well-defined components by leveraging its heavy tails. Through isolating the anomalous observations in a single component, we simultaneously identify the observations in the net as warranting further inspection and prevent them from interfering with the formation of the remaining components. The result is a framework that allows for simultaneously clustering observations and making predictions in the face of the anomalous data. We demonstrate the usefulness of the CNMM in a variety of experimental situations and apply the model for predicting housing prices in Fairfax County, Virginia.
- Cure Rate Model with Spline Estimated ComponentsWang, Lu (Virginia Tech, 2010-07-13)In some survival analysis of medical studies, there are often long term survivors who can be considered as permanently cured. The goals in these studies are to estimate the cure probability of the whole population and the hazard rate of the noncured subpopulation. The existing methods for cure rate models have been limited to parametric and semiparametric models. More specifically, the hazard function part is estimated by parametric or semiparametric model where the effect of covariate takes a parametric form. And the cure rate part is often estimated by a parametric logistic regression model. We introduce a non-parametric model employing smoothing splines. It provides non-parametric smooth estimates for both hazard function and cure rate. By introducing a latent cure status variable, we implement the method using a smooth EM algorithm. Louis' formula for covariance estimation in an EM algorithm is generalized to yield point-wise confidence intervals for both functions. A simple model selection procedure based on the Kullback-Leibler geometry is derived for the proposed cure rate model. Numerical studies demonstrate excellent performance of the proposed method in estimation, inference and model selection. The application of the method is illustrated by the analysis of a melanoma study.
- Dimension Reduction and Clustering for Interactive Visual AnalyticsWenskovitch Jr, John Edward (Virginia Tech, 2019-09-06)When exploring large, high-dimensional datasets, analysts often utilize two techniques for reducing the data to make exploration more tractable. The first technique, dimension reduction, reduces the high-dimensional dataset into a low-dimensional space while preserving high-dimensional structures. The second, clustering, groups similar observations while simultaneously separating dissimilar observations. Existing work presents a number of systems and approaches that utilize these techniques; however, these techniques can cooperate or conflict in unexpected ways. The core contribution of this work is the systematic examination of the design space at the intersection of dimension reduction and clustering when building intelligent, interactive tools in visual analytics. I survey existing techniques for dimension reduction and clustering algorithms in visual analytics tools, and I explore the design space for creating projections and interactions that include dimension reduction and clustering algorithms in the same visual interface. Further, I implement and evaluate three prototype tools that implement specific points within this design space. Finally, I run a cognitive study to understand how analysts perform dimension reduction (spatialization) and clustering (grouping) operations. Contributions of this work include surveys of existing techniques, three interactive tools and usage cases demonstrating their utility, design decisions for implementing future tools, and a presentation of complex human organizational behaviors.
- A Dirichlet process model for classifying and forecasting epidemic curvesNsoesie, Elaine O.; Leman, Scotland C.; Marathe, M. V. (Biomed Central, 2014-01-09)
- Efficient formulation and implementation of ensemble based methods in data assimilationNino Ruiz, Elias David (Virginia Tech, 2016-01-11)Ensemble-based methods have gained widespread popularity in the field of data assimilation. An ensemble of model realizations encapsulates information about the error correlations driven by the physics and the dynamics of the numerical model. This information can be used to obtain improved estimates of the state of non-linear dynamical systems such as the atmosphere and/or the ocean. This work develops efficient ensemble-based methods for data assimilation. A major bottleneck in ensemble Kalman filter (EnKF) implementations is the solution of a linear system at each analysis step. To alleviate it an EnKF implementation based on an iterative Sherman Morrison formula is proposed. The rank deficiency of the ensemble covariance matrix is exploited in order to efficiently compute the analysis increments during the assimilation process. The computational effort of the proposed method is comparable to those of the best EnKF implementations found in the current literature. The stability analysis of the new algorithm is theoretically proven based on the positiveness of the data error covariance matrix. In order to improve the background error covariance matrices in ensemble-based data assimilation we explore the use of shrinkage covariance matrix estimators from ensembles. The resulting filter has attractive features in terms of both memory usage and computational complexity. Numerical results show that it performs better that traditional EnKF formulations. In geophysical applications the correlations between errors corresponding to distant model components decreases rapidly with the distance. We propose a new and efficient implementation of the EnKF based on a modified Cholesky decomposition for inverse covariance matrix estimation. This approach exploits the conditional independence of background errors between distant model components with regard to a predefined radius of influence. Consequently, sparse estimators of the inverse background error covariance matrix can be obtained. This implies huge memory savings during the assimilation process under realistic weather forecast scenarios. Rigorous error bounds for the resulting estimator in the context of data assimilation are theoretically proved. The conclusion is that the resulting estimator converges to the true inverse background error covariance matrix when the ensemble size is of the order of the logarithm of the number of model components. We explore high-performance implementations of the proposed EnKF algorithms. When the observational operator can be locally approximated for different regions of the domain, efficient parallel implementations of the EnKF formulations presented in this dissertation can be obtained. The parallel computation of the analysis increments is performed making use of domain decomposition. Local analysis increments are computed on (possibly) different processors. Once all local analysis increments have been computed they are mapped back onto the global domain to recover the global analysis. Tests performed with an atmospheric general circulation model at a T-63 resolution, and varying the number of processors from 96 to 2,048, reveal that the assimilation time can be decreased multiple fold for all the proposed EnKF formulations.Ensemble-based methods can be used to reformulate strong constraint four dimensional variational data assimilation such as to avoid the construction of adjoint models, which can be complicated for operational models. We propose a trust region approach based on ensembles in which the analysis increments are computed onto the space of an ensemble of snapshots. The quality of the resulting increments in the ensemble space is compared against the gains in the full space. Decisions on whether accept or reject solutions rely on trust region updating formulas. Results based on a atmospheric general circulation model with a T-42 resolution reveal that this methodology can improve the analysis accuracy.
- Expert-Guided Generative Topographical Modeling with Visual to Parametric InteractionHan, Chao; House, Leanna L.; Leman, Scotland C. (PLOS, 2016-02-23)Introduced by Bishop et al. in 1996, Generative Topographic Mapping (GTM) is a powerful nonlinear latent variable modeling approach for visualizing high-dimensional data. It has shown useful when typical linear methods fail. However, GTM still suffers from drawbacks. Its complex parameterization of data make GTM hard to fit and sensitive to slight changes in the model. For this reason, we extend GTM to a visual analytics framework so that users may guide the parameterization and assess the data from multiple GTM perspectives. Specifically, we develop the theory and methods for Visual to Parametric Interaction (V2PI) with data using GTM visualizations. The result is a dynamic version of GTM that fosters data exploration. We refer to the new version as V2PI-GTM. In this paper, we develop V2PI-GTM in stages and demonstrate its benefits within the context of a text mining case study.
- Exploring the Landscape of Big Data Analytics Through Domain-Aware Algorithm DesignDash, Sajal (Virginia Tech, 2020-08-20)Experimental and observational data emerging from various scientific domains necessitate fast, accurate, and low-cost analysis of the data. While exploring the landscape of big data analytics, multiple challenges arise from three characteristics of big data: the volume, the variety, and the velocity. High volume and velocity of the data warrant a large amount of storage, memory, and compute power while a large variety of data demands cognition across domains. Addressing domain-intrinsic properties of data can help us analyze the data efficiently through the frugal use of high-performance computing (HPC) resources. In this thesis, we present our exploration of the data analytics landscape with domain-aware approximate and incremental algorithm design. We propose three guidelines targeting three properties of big data for domain-aware big data analytics: (1) explore geometric and domain-specific properties of high dimensional data for succinct representation, which addresses the volume property, (2) design domain-aware algorithms through mapping of domain problems to computational problems, which addresses the variety property, and (3) leverage incremental arrival of data through incremental analysis and invention of problem-specific merging methodologies, which addresses the velocity property. We demonstrate these three guidelines through the solution approaches of three representative domain problems. We present Claret, a fast and portable parallel weighted multi-dimensional scaling (WMDS) tool, to demonstrate the application of the first guideline. It combines algorithmic concepts extended from the stochastic force-based multi-dimensional scaling (SF-MDS) and Glimmer. Claret computes approximate weighted Euclidean distances by combining a novel data mapping called stretching and Johnson Lindestrauss' lemma to reduce the complexity of WMDS from O(f(n)d) to O(f(n) log d). In demonstrating the second guideline, we map the problem of identifying multi-hit combinations of genetic mutations responsible for cancers to weighted set cover (WSC) problem by leveraging the semantics of cancer genomic data obtained from cancer biology. Solving the mapped WSC with an approximate algorithm, we identified a set of multi-hit combinations that differentiate between tumor and normal tissue samples. To identify three- and four-hits, which require orders of magnitude larger computational power, we have scaled out the WSC algorithm on a hundred nodes of Summit supercomputer. In demonstrating the third guideline, we developed a tool iBLAST to perform an incremental sequence similarity search. Developing new statistics to combine search results over time makes incremental analysis feasible. iBLAST performs (1+δ)/δ times faster than NCBI BLAST, where δ represents the fraction of database growth. We also explored various approaches to mitigate catastrophic forgetting in incremental training of deep learning models.
- Generalized Likelihood Uncertainty Estimation and Markov Chain Monte Carlo Simulation to Prioritize TMDL Pollutant AllocationsMishra, Anurag; Ahmadisharaf, Ebrahim; Benham, Brian L.; Wolfe, Mary Leigh; Leman, Scotland C.; Gallagher, Daniel L.; Reckhow, Kenneth H.; Smith, Eric P. (2018-12)This study presents a probabilistic framework that considers both the water quality improvement capability and reliability of alternative total maximum daily load (TMDL) pollutant allocations. Generalized likelihood uncertainty estimation and Markov chain Monte Carlo techniques were used to assess the relative uncertainty and reliability of two alternative TMDL pollutant allocations that were developed to address a fecal coliform (FC) bacteria impairment in a rural watershed in western Virginia. The allocation alternatives, developed using the Hydrological Simulation Program-FORTRAN, specified differing levels of FC bacteria reduction from different sources. While both allocations met the applicable water-quality criteria, the approved TMDL allocation called for less reduction in the FC source that produced the greatest uncertainty (cattle directly depositing feces in the stream), suggesting that it would be less reliable than the alternative, which called for a greater reduction from that same source. The approach presented in this paper illustrates a method to incorporate uncertainty assessment into TMDL development, thereby enabling stakeholders to engage in more informed decision making.
- Genome Landscape and Evolutionary Plasticity of Chromosomes in Malaria MosquitoesXia, Ai; Sharakhova, Maria V.; Leman, Scotland C.; Tu, Zhijian Jake; Bailey, Jeffrey A.; Smith, Christopher D.; Sharakhov, Igor V. (PLOS, 2010-05-12)Background: Nonrandom distribution of rearrangements is a common feature of eukaryotic chromosomes that is not well understood in terms of genome organization and evolution. In the major African malaria vector Anopheles gambiae, polymorphic inversions are highly nonuniformly distributed among five chromosomal arms and are associated with epidemiologically important adaptations. However, it is not clear whether the genomic content of the chromosomal arms is associated with inversion polymorphism and fixation rates. Methodology/Principal Findings: To better understand the evolutionary dynamics of chromosomal inversions, we created a physical map for an Asian malaria mosquito, Anopheles stephensi, and compared it with the genome of An. gambiae. We also developed and deployed novel Bayesian statistical models to analyze genome landscapes in individual chromosomal arms An. gambiae. Here, we demonstrate that, despite the paucity of inversion polymorphisms on the X chromosome, this chromosome has the fastest rate of inversion fixation and the highest density of transposable elements, simple DNA repeats, and GC content. The highly polymorphic and rapidly evolving autosomal 2R arm had overrepresentation of genes involved in cellular response to stress supporting the role of natural selection in maintaining adaptive polymorphic inversions. In addition, the 2R arm had the highest density of regions involved in segmental duplications that clustered in the breakpoint-rich zone of the arm. In contrast, the slower evolving 2L, 3R, and 3L, arms were enriched with matrixattachment regions that potentially contribute to chromosome stability in the cell nucleus. Conclusions/Significance: These results highlight fundamental differences in evolutionary dynamics of the sex chromosome and autosomes and revealed the strong association between characteristics of the genome landscape and rates of chromosomal evolution. We conclude that a unique combination of various classes of genes and repetitive DNA in each arm, rather than a single type of repetitive element, is likely responsible for arm-specific rates of rearrangements.
- Genome mapping and characterization of the Anopheles gambiae heterochromatinSharakhova, Maria V.; George, Phillip; Brusentsova, Irina V.; Leman, Scotland C.; Bailey, Jeffrey A.; Smith, Christopher D.; Sharakhov, Igor V. (Biomed Central, 2010-08-04)Background Heterochromatin plays an important role in chromosome function and gene regulation. Despite the availability of polytene chromosomes and genome sequence, the heterochromatin of the major malaria vector Anopheles gambiae has not been mapped and characterized. Results To determine the extent of heterochromatin within the An. gambiae genome, genes were physically mapped to the euchromatin-heterochromatin transition zone of polytene chromosomes. The study found that a minimum of 232 genes reside in 16.6 Mb of mapped heterochromatin. Gene ontology analysis revealed that heterochromatin is enriched in genes with DNA-binding and regulatory activities. Immunostaining of the An. gambiae chromosomes with antibodies against Drosophila melanogaster heterochromatin protein 1 (HP1) and the nuclear envelope protein lamin Dm0 identified the major invariable sites of the proteins' localization in all regions of pericentric heterochromatin, diffuse intercalary heterochromatin, and euchromatic region 9C of the 2R arm, but not in the compact intercalary heterochromatin. To better understand the molecular differences among chromatin types, novel Bayesian statistical models were developed to analyze genome features. The study found that heterochromatin and euchromatin differ in gene density and the coverage of retroelements and segmental duplications. The pericentric heterochromatin had the highest coverage of retroelements and tandem repeats, while intercalary heterochromatin was enriched with segmental duplications. We also provide evidence that the diffuse intercalary heterochromatin has a higher coverage of DNA transposable elements, minisatellites, and satellites than does the compact intercalary heterochromatin. The investigation of 42-Mb assembly of unmapped genomic scaffolds showed that it has molecular characteristics similar to cytologically mapped heterochromatin. Conclusions Our results demonstrate that Anopheles polytene chromosomes and whole-genome shotgun assembly render the mapping and characterization of a significant part of heterochromatic scaffolds a possibility. These results reveal the strong association between characteristics of the genome features and morphological types of chromatin. Initial analysis of the An. gambiae heterochromatin provides a framework for its functional characterization and comparative genomic analyses with other organisms.
- Gradient-Based Sensitivity Analysis with KernelsWycoff, Nathan Benjamin (Virginia Tech, 2021-08-20)Emulation of computer experiments via surrogate models can be difficult when the number of input parameters determining the simulation grows any greater than a few dozen. In this dissertation, we explore dimension reduction in the context of computer experiments. The active subspace method is a linear dimension reduction technique which uses the gradients of a function to determine important input directions. Unfortunately, we cannot expect to always have access to the gradients of our black-box functions. We thus begin by developing an estimator for the active subspace of a function using kernel methods to indirectly estimate the gradient. We then demonstrate how to deploy the learned input directions to improve the predictive performance of local regression models by ``undoing" the active subspace. Finally, we develop notions of sensitivities which are local to certain parts of the input space, which we then use to develop a Bayesian optimization algorithm which can exploit locally important directions.
- Hierarchical Gaussian Processes for Spatially Dependent Model SelectionFry, James Thomas (Virginia Tech, 2018-07-18)In this dissertation, we develop a model selection and estimation methodology for nonstationary spatial fields. Large, spatially correlated data often cover a vast geographical area. However, local spatial regions may have different mean and covariance structures. Our methodology accomplishes three goals: (1) cluster locations into small regions with distinct, stationary models, (2) perform Bayesian model selection within each cluster, and (3) correlate the model selection and estimation in nearby clusters. We utilize the Conditional Autoregressive (CAR) model and Ising distribution to provide intra-cluster correlation on the linear effects and model inclusion indicators, while modeling inter-cluster correlation with separate Gaussian processes. We apply our model selection methodology to a dataset involving the prediction of Brook trout presence in subwatersheds across Pennsylvania. We find that our methodology outperforms the stationary spatial model and that different regions in Pennsylvania are governed by separate Gaussian process regression models.
- «
- 1 (current)
- 2
- 3
- »