Browsing by Author "Zhu, Hongxiao"
Now showing 1 - 20 of 33
Results Per Page
Sort Options
- Analysis of Bat Biosonar Beampatterns: Biodiversity and DynamicsCaspers, Philip Bryan (Virginia Tech, 2017-01-24)Across species, bats exhibit wildly disparate differences in their noseleaf and pinnae shapes. Within Rhinolophid and Hipposiderid families, bats actively deform their pinnae and noseleaf during biosonar operation. Both the pinnae and noseleaf act as acoustic baffles which interact with the outgoing and incoming sound; thus, they form an important interface between the bat and its environment. Beampatterns describe this interface as joint time-frequency transfer functions which vary across spatial direction. This dissertation considers bat biosonar shape diversity and shape dynamics manifest as beampatterns. In the first part, the seemingly disparate set of functional properties resulting from diverse pinnae and noseleaf shape adaptations are considered. The question posed in this part is as follows: (i) what are the common properties between species beampatterns? and (ii) how are beampatterns aligned to a common direction for meaningful analysis? Hence, a quantitative interspecific analysis of the beampattern biodiversity was taken wherein: (i) unit[267]{} different pinnae and noseleaf beampatterns were rotationally aligned to a common direction and (ii) decomposed using principal component analysis, PCA. The first three principal components termed eigenbeams affect beamwidth around the single lobe, symmetric mean beampattern. Dynamic shape adaptations to the pinnae and noseleaf of the greater horseshoe bat (textit{Rhinolophus ferrumequinum}) are also considered. However, the underlying dynamic sensing principles in use are not clear. Hence, this work developed a biomimetic substrate to explore the emission and reception dynamics of the horseshoe bat as a sonar device. The question posed in this part was as follows: how do local features on the noseleaf and pinnae interact individually and when combined together to generate peak dynamic change to the incoming sonar information? Flexible noseleaf and pinnae baffles with different combinations of local shape features were developed. These baffles were then mounted to platforms to biomimetically actuate the noseleaf and pinnae during pulse emission and reception. Motions of the baffle surfaces were synchronized to the incoming and outgoing sonar waveform, and the time-frequency properties of the emission and reception baffles were characterized across spatial direction. Different feature combinations of the noseleaf and pinnae local shape features were ranked for overall dynamic effect.
- Assessment of Penalized Regression for Genome-wide Association StudiesYi, Hui (Virginia Tech, 2014-08-27)The data from genome-wide association studies (GWAS) in humans are still predominantly analyzed using single marker association methods. As an alternative to Single Marker Analysis (SMA), all or subsets of markers can be tested simultaneously. This approach requires a form of Penalized Regression (PR) as the number of SNPs is much larger than the sample size. Here we review PR methods in the context of GWAS, extend them to perform penalty parameter and SNP selection by False Discovery Rate (FDR) control, and assess their performance (including penalties incorporating linkage disequilibrium) in comparison with SMA. PR methods were compared with SMA on realistically simulated GWAS data consisting of genotype data from single and multiple chromosomes and a continuous phenotype and on real data. Based on our comparisons our analytic FDR criterion may currently be the best approach to SNP selection using PR for GWAS. We found that PR with FDR control provides substantially more power than SMA with genome-wide type-I error control but somewhat less power than SMA with Benjamini-Hochberg FDR control. PR controlled the FDR conservatively while SMA-BH may not achieve FDR control in all situations. Differences among PR methods seem quite small when the focus is on variable selection with FDR control. Incorporating LD into PR by adapting penalties developed for covariates measured on graphs can improve power but also generate morel false positives or wider regions for follow-up. We recommend using the Elastic Net with a mixing weight for the Lasso penalty near 0.5 as the best method.
- Association testing for binary trees-A Markov branching process approachWu, Xiaowei; Zhu, Hongxiao (Wiley, 2022-03-09)We propose a new approach to test associations between binary trees and covariates. In this approach, binary-tree structured data are treated as sample paths of binary fission Markov branching processes (bMBP). We propose a generalized linear regression model and developed inference procedures for association testing, including variable selection and estimation of covariate effects. Simulation studies show that these procedures are able to accurately identify covariates that are associated with the binary tree structure by impacting the rate parameter of the bMBP. The problem of association testing on binary trees is motivated by modeling hierarchical clustering dendrograms of pixel intensities in biomedical images. By using semi-synthetic data generated from a real brain-tumor image, our simulation studies show that the bMBP model is able to capture the characteristics of dendrogram trees in brain-tumor images. Our final analysis of the glioblastoma multiforme brain-tumor data from The Cancer Imaging Archive identified multiple clinical and genetic variables that are potentially associated with brain-tumor heterogeneity.
- A Bayesian Analysis of Copy Number Variations in Array Comparative Genomic Hybridization DataWu, Xiaowei; Zhu, Hongxiao (OMICS International, 2015-09-25)Array Comparative Genomic Hybridization (CGH) has been widely used for detecting genomic copy number variations (CNVs). The central goal of array CGH data analysis is to accurately detect homogeneous regions of log intensity ratios which represent relative changes in DNA copy number. Various methods have been proposed in recent years. Most methods, however, do not consider correlations of neighboring probe measurements, and are usually designed for analysis at single sample level rather than detecting common or recurrent CNVs among multiple samples. We propose a Bayesian segment-based approach for efficient analysis of array CGH data. The proposed method is based on simple assumptions but is general enough to accommodate various spatial correlations among probe measurements. It also allows for multiple samples with recurrent CNVs, therefore is able to borrow strength across samples. In contrast to another probe-based approach developed in the same Bayesian framework, the segment-based approach parameterizes the mean log intensity ratios in a more appropriate way, which leads to a posterior sampling scheme based on reversible-jump Markov chain Monte Carlo. We perform a simulation study to compare these two approaches and the commonly-used circular binary segmentation method and Bayesian hidden Markov model method. The segment-based approach achieves better estimation accuracy and higher computational efficiency compared to the probe-based approach, and also provides improved results compared to the other two methods, especially for data with relatively low signal to noise ratio and high correlation. The segment-based approach is further applied to the Corriel cell lines data and Pancreatic Adenocarcinoma data.
- Bayesian Graphical Models for Multivariate Functional DataZhu, Hongxiao; Strawn, Nate; Dunson, David B. (2016-11-28)Graphical models express conditional independence relationships among variables. Although methods for vector-valued data are well established, functional data graphical models remain underdeveloped. By functional data, we refer to data that are realizations of random functions varying over a continuum (e.g., images, signals). We introduce a notion of conditional independence between random functions, and construct a framework for Bayesian inference of undirected, decomposable graphs in the multivariate functional data context. This framework is based on extending Markov distributions and hyper Markov laws from random variables to random processes, providing a principled alternative to naive application of multivariate methods to discretized functional data. Markov properties facilitate the composition of likelihoods and priors according to the decomposition of a graph. Our focus is on Gaussian process graphical models using orthogonal basis expansions. We propose a hyper-inverse-Wishart-process prior for the covariance kernels of the infinite coeficient sequences of the basis expansion, and establish its existence and uniqueness. We also prove the strong hyper Markov property and the conjugacy of this prior under a finite rank condition of the prior kernel parameter. Stochastic search Markov chain Monte Carlo algorithms are developed for posterior inference, assessed through simulations, and applied to a study of brain activity and alcoholism.
- Bayesian Modeling of Complex High-Dimensional DataHuo, Shuning (Virginia Tech, 2020-12-07)With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional complex data in different forms, such as medical images, genomics measurements. However, acquisition of more data does not automatically lead to better knowledge discovery. One needs efficient and reliable analytical tools to extract useful information from complex datasets. The main objective of this dissertation is to develop innovative Bayesian methodologies to enable effective and efficient knowledge discovery from complex high-dimensional data. It contains two parts—the development of computationally efficient functional mixed models and the modeling of data heterogeneity via Dirichlet Diffusion Tree. The first part focuses on tackling the computational bottleneck in Bayesian functional mixed models. We propose a computational framework called variational functional mixed model (VFMM). This new method facilitates efficient data compression and high-performance computing in basis space. We also propose a new multiple testing procedure in basis space, which can be used to detect significant local regions. The effectiveness of the proposed model is demonstrated through two datasets, a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part is about modeling data heterogeneity by using Dirichlet Diffusion Trees. We propose a Bayesian latent tree model that incorporates covariates of subjects to characterize the heterogeneity and uncover the latent tree structure underlying data. This innovative model may reveal the hierarchical evolution process through branch structures and estimate systematic differences between groups of samples. We demonstrate the effectiveness of the model through the simulation study and a brain tumor real data.
- Bridging Machine Learning and Experimental Design for Enhanced Data Analysis and OptimizationGuo, Qing (Virginia Tech, 2024-07-19)Experimental design is a powerful tool for gathering highly informative observations using a small number of experiments. The demand for smart data collection strategies is increasing due to the need to save time and budget, especially in online experiments and machine learning. However, the traditional experimental design method falls short in systematically assessing changing variables' effects. Specifically within Artificial Intelligence (AI), the challenge lies in assessing the impacts of model structures and training strategies on task performances with a limited number of trials. This shortfall underscores the necessity for the development of novel approaches. On the other side, the optimal design criterion has typically been model-based in classic design literature, which leads to restricting the flexibility of experimental design strategies. However, machine learning's inherent flexibility can empower the estimation of metrics efficiently using nonparametric and optimization techniques, thereby broadening the horizons of experimental design possibilities. In this dissertation, the aim is to develop a set of novel methods to bridge the merits between these two domains: 1) applying ideas from statistical experimental design to enhance data efficiency in machine learning, and 2) leveraging powerful deep neural networks to optimize experimental design strategies. This dissertation consists of 5 chapters. Chapter 1 provides a general introduction to mutual information, fractional factorial design, hyper-parameter tuning, multi-modality, etc. In Chapter 2, I propose a new mutual information estimator FLO by integrating techniques from variational inference (VAE), contrastive learning, and convex optimization. I apply FLO to broad data science applications, such as efficient data collection, transfer learning, fair learning, etc. Chapter 3 introduces a new design strategy called multi-layer sliced design (MLSD) with the application of AI assurance. It focuses on exploring the effects of hyper-parameters under different models and optimization strategies. Chapter 4 investigates classic vision challenges via multimodal large language models by implicitly optimizing mutual information and thoroughly exploring training strategies. Chapter 5 concludes this proposal and discusses several future research topics.
- Change Detection and Analysis of Data with Heterogeneous StructuresChu, Shuyu (Virginia Tech, 2017-07-28)Heterogeneous data with different characteristics are ubiquitous in the modern digital world. For example, the observations collected from a process may change on its mean or variance. In numerous applications, data are often of mixed types including both discrete and continuous variables. Heterogeneity also commonly arises in data when underlying models vary across different segments. Besides, the underlying pattern of data may change in different dimensions, such as in time and space. The diversity of heterogeneous data structures makes statistical modeling and analysis challenging. Detection of change-points in heterogeneous data has attracted great attention from a variety of application areas, such as quality control in manufacturing, protest event detection in social science, purchase likelihood prediction in business analytics, and organ state change in the biomedical engineering. However, due to the extraordinary diversity of the heterogeneous data structures and complexity of the underlying dynamic patterns, the change-detection and analysis of such data is quite challenging. This dissertation aims to develop novel statistical modeling methodologies to analyze four types of heterogeneous data and to find change-points efficiently. The proposed approaches have been applied to solve real-world problems and can be potentially applied to a broad range of areas.
- A computational model for biosonar echoes from foliageMing, Chen; Gupta, Anupam Kumar; Lu, Ruijin; Zhu, Hongxiao; Müller, Rolf (PLOS, 2017-08-17)Since many bat species thrive in densely vegetated habitats, echoes from foliage are likely to be of prime importance to the animals’ sensory ecology, be it as clutter that masks prey echoes or as sources of information about the environment. To better understand the characteristics of foliage echoes, a new model for the process that generates these signals has been developed. This model takes leaf size and orientation into account by representing the leaves as circular disks of varying diameter. The two added leaf parameters are of potential importance to the sensory ecology of bats, e.g., with respect to landmark recognition and flight guidance along vegetation contours. The full model is specified by a total of three parameters: leaf density, average leaf size, and average leaf orientation. It assumes that all leaf parameters are independently and identically distributed. Leaf positions were drawn from a uniform probability density function, sizes and orientations each from a Gaussian probability function. The model was found to reproduce the first-order amplitude statistics of measured example echoes and showed time-variant echo properties that depended on foliage parameters. Parameter estimation experiments using lasso regression have demonstrated that a single foliage parameter can be estimated with high accuracy if the other two parameters are known a priori. If only one parameter is known a priori, the other two can still be estimated, but with a reduced accuracy. Lasso regression did not support simultaneous estimation of all three parameters. Nevertheless, these results demonstrate that foliage echoes contain accessible information on foliage type and orientation that could play a role in supporting sensory tasks such as landmark identification and contour following in echolocating bats.
- Corporate Default Predictions and Methods for Uncertainty QuantificationsYuan, Miao (Virginia Tech, 2016-08-01)Regarding quantifying uncertainties in prediction, two projects with different perspectives and application backgrounds are presented in this dissertation. The goal of the first project is to predict the corporate default risks based on large-scale time-to-event and covariate data in the context of controlling credit risks. Specifically, we propose a competing risks model to incorporate exits of companies due to default and other reasons. Because of the stochastic and dynamic nature of the corporate risks, we incorporate both company-level and market-level covariate processes into the event intensities. We propose a parsimonious Markovian time series model and a dynamic factor model (DFM) to efficiently capture the mean and correlation structure of the high-dimensional covariate dynamics. For estimating parameters in the DFM, we derive an expectation maximization (EM) algorithm in explicit forms under necessary constraints. For multi-period default risks, we consider both the corporate-level and the market-level predictions. We also develop prediction interval (PI) procedures that synthetically take uncertainties in the future observation, parameter estimation, and the future covariate processes into account. In the second project, to quantify the uncertainties in the maximum likelihood (ML) estimators and compute the exact tolerance interval (TI) factors regarding the nominal confidence level, we propose algorithms for two-sided control-the-center and control-both-tails TI for complete or Type II censored data following the (log)-location-scale family of distributions. Our approaches are based on pivotal properties of ML estimators of parameters for the (log)-location-scale family and utilize the Monte-Carlo simulations. While for Type I censored data, only approximate pivotal quantities exist. An adjusted procedure is developed to compute the approximate factors. The observed CP is shown to be asymptotically accurate by our simulation study. Our proposed methods are illustrated using real-data examples.
- 'Cut from the same cloth': Shared microsatellite variants among cancers link to ectodermal tissues-neural tube and crest cellsKarunasena, Enusha; McIver, Lauren J.; Bavarva, Jasmin H.; Wu, Xiaowei; Zhu, Hongxiao; Garner, Harold R. (Impact Journals, 2015-09-08)
- Dynamic Emission Baffle Inspired by Horseshoe Bat NoseleavesFu, Yanqing (Virginia Tech, 2016-03-04)The evolution of bats is characterized by a combination of two key innovations - powered flight and biosonar - that are unique among mammals. Bats still outperform engineered systems in both capabilities by a large margin. Bat biosonar stands out for its ability to encode and extract sensory information using various mechanisms such as adaptive beam width control, dynamic sound emission and reception, as well as cognitive processes. Due to the highly integrated and sophisticated design of their active sonar system, bats can survive in complex and dense environments using just a few simple smart acoustic elements. On the sound emission side, significant features that distinguish bats from the current man-made sonar system are the time-variant shapes of the noseleaves. Noseleaves are baffles that surround the nostrils in bats with nasal pulse emission such as horseshoe bats and can undergo non-rigid deformations large enough to affect their acoustic properties significantly. Behavioral studies have shown that these movements are not random byproducts, but are due to specific muscular action. To understand the underlying physical and engineering principles of the dynamic sensing in horseshoe bats, two experimental prototypes ,i.e. intact noseleaf and simplified noseleaf, have been used. We have integrated techniques of data acquisition, instrument control, additive manufacturing, signal processing, airborne acoustics, 3D modeling and image processing to facilitate this research. 3D models of horseshoe bat noseleaves were obtained by tomographic imaging, reconstructed, and modified in the digital domain to meet the needs of additive manufacturing prototype. Nostrils and anterior leaf were abstracted as an elliptical outlet and a concave baffle in the other prototype. As a reference, a circular outlet and a straight baffle designed. A data acquisition and instrument control system has been developed and integrated with transducers to characterize the dynamic emission system acoustically as well as actuators for recreating the dynamics of the horseshoe bat noseleaf. A conical horn and tube waveguide was designed to couple the loudspeaker to the outlet of bat noseleaf and simplified baffles. A pan-tilt was used to characterize the acoustic properties of the deforming prototypes over direction. By using those techniques, the dynamic effect of the noseleaf was reproduced and characterized. It was suggested that the lancet rotation induced both beam-gain and beamwidth changes. Narrow outlet produced an isotropic beampattern and concave baffle had a significant time-variant and frequency-variant effect with just a small displacement. All those results cast light on the possible functions of the biological morphology and provided new thoughts on the engineering device's design.
- Estimate the Unknown Environment with Biosonar Echoes—A Simulation StudyTanveer, Muhammad Hassan; Thomas, Antony; Ahmed, Waqar; Zhu, Hongxiao (MDPI, 2021-06-18)Unmanned aerial vehicles (UAVs) have shown great potential in various applications such as surveillance, search and rescue. To perform safe and efficient navigation, it is vitally important for a UAV to evaluate the environment accurately and promptly. In this work, we present a simulation study for the estimation of foliage distribution as a UAV equipped with biosonar navigates through a forest. Based on a simulated forest environment, foliage echoes are generated by using a bat-inspired bisonar simulator. These biosonar echoes are then used to estimate the spatial distribution of both sparsely and densely distributed tree leaves. While a simple batch processing method is able to estimate sparsely distributed leaf locations well, a wavelet scattering technique coupled with a support vector machine (SVM) classifier is shown to be effective to estimate densely distributed leaves. Our approach is validated by using multiple setups of leaf distributions in the simulated forest environment. Ninety-seven percent accuracy is obtained while estimating thickly distributed foliage.
- Evaluating and Improving Performance of Bisulfite Short Reads Alignment and the Identification of Differentially Methylated SitesTran, Hong Thi Thanh (Virginia Tech, 2018-01-18)Large-scale bisulfite treatment and short reads sequencing technology allows comprehensive estimation of methylation states of Cs in the genomes of different tissues, cell types, and developmental stages. Accurate characterization of DNA methylation is essential for understanding genotype phenotype association, gene and environment interaction, diseases, and cancer. The thesis work first evaluates the performance of several commonly used bisulfite short read mappers and investigates how pre-processing data might affect the performance. Aligning bisulfite short reads to a reference genome remains a challenging task. In practice, only a limited proportion of bisulfite treated DNA reads can be mapped uniquely (around 50-70%) while a significant proportion of reads (called multireads) are aligned to multiple genomic locations. The thesis outlines a strategy to improve the mapping efficiencies of the existing bisulfite short reads software by finding unique locations for multireads. Analyses of both simulated data and real hairpin bisulfite sequencing data show that our strategy can effectively assign approximately 70% of the multireads to their best locations with up to 90% accuracy, leading to a significant increase in the overall mapping efficiency. The most common and essential downstream task in DNA methylation analysis is to detect differential methylated cytosines (DMCs). Although many statistical methods have been applied to detect DMCs, inconsistency in detecting differential methylated sites among statistical tools remains. We adapt the wavelet-based functional mixed models (WFMM) to detect DMCs. Analyses of simulated Arabidopsis data show that WFMM has higher sensitivities and specificities in detecting DMCs compared to existing methods especially when methylation differences are small. Analyses of monozygotic twin data who have different pain sensitivity also show that WFMM can find more relevant DMCs related to pain sensitivity compared to methylKit. In addition, we provide a strategy to modify the default settings in both WFMM and methylKit to be more tailored to a given methylation profile, thus improving the accuracy of detecting DMCs. Population growth and climate change leave billions of people around the world living in water scarcity conditions. Therefore, utility of reclaimed water (treated wastewater) is pivotal for water sustainability. Recently, researchers discovered microbial regrowth problems in reclaimed water distribution systems (RWDs). The third part of the thesis involves: 1) identifying fundamental conditions that affect proliferation of antibiotic resistance genes (ARGs), 2) identifying the effect of water chemistry and water age on microbial regrowth, and 3) characterizing co-occurrence of ARGs and/or mobile genetics elements (MGEs), i.e., plasmids in simulated RWDs. Analyses of preliminary results from simulated RWDs show that biofilms, bulk water environment, temperature, and disinfectant types have significant influence on shaping antibiotic resistant bacteria (ARB) communities. In particular, biofilms create a favorable environment for ARGs to diversify but with lower total ARG populations. ARGs are the least diverse at 300C and the most diverse at 220C. Disinfectants reduce ARG populations as well as ARG diversity. Chloramines keep ARG populations and diversity at the lowest rate. Disinfectants work better in bulk water environment than in biofilms in terms of shaping resistome. Network analysis on assembly data is done to determine which ARG pairs are the most co-occurred. Bayesian network is more consistent with the co-occurrence network constructed from assembly data than the network based on Spearman's correlation network of ARG abundance profiles.
- Foliage Echoes and Sensing in Natural EnvironmentsMing, Chen (Virginia Tech, 2017-09-07)Foliage is very common feature in the habitats of echolocation bats and thus its echoes constitute the major input of bats' sensory systems. Acquiring useful information from vegetation echoes facilitates the bats significantly in the navigation and foraging behaviors. To better understand the foliage echoes, in this dissertation, a computer model was constructed to simulate foliage echoes with following simplifications: approximating leaves as circular disks, leaving out shading effects between leaves, and distributing leaves uniformly in the space. Then one tree can be described with three parameters in the model, leaf radius, orientation, and leaf density, where the first two determine the beampattern of each leaf. Compared with echoes collected from real trees, the simulation echoes are qualitatively accurate, i.e., they match in waveforms and also first-order statistics. Since the ground truth is known in the model, the three parameters were estimated with lasso model by selecting 40 features from each echo. The results have shown that estimation of one parameter with the other two known is usually successful with coefficient of determination close to one, and the classification still has reasonable accuracy when the number of known parameter is reduced to one. Besides, the three simplifications were examined with both experimental and simulation approaches. To assess the acoustic impact of leaf geometry on individual leaves, experiments were carried out by ensonifying leaves from both a single and different species. How the leaves' impulse responses change according to their equivalent radii was investigated. The simulation model of disks fits the experiments done with real leaves within one species and across species reasonably well. Shading effect is found to exist locally when two disks were 25 cm apart and were both in pulse direction. In addition, the inhomogeneous distribution of leaves was introduced by using the branching patterns of L-system. The evaluation of inhomogeneity in echoes produced with two distributions shows that there is always inhomogeneity in echoes, and L-system model does bring more inhomogeneity but not to the same extent as changes in the relative orientation between sonar beam and foliage do.
- Identification of Differentially Methylated Sites with Weak Methylation EffectsTran, Hong T.; Zhu, Hongxiao; Wu, Xiaowei; Kim, Gunjune; Clarke, Christopher R.; Larose, Hailey; Haak, David C.; Askew, Shawn D.; Barney, Jacob; Westwood, James H.; Zhang, Liqing (MDPI, 2018-02-08)Deoxyribonucleic acid (DNA) methylation is an epigenetic alteration crucial for regulating stress responses. Identifying large-scale DNA methylation at single nucleotide resolution is made possible by whole genome bisulfite sequencing. An essential task following the generation of bisulfite sequencing data is to detect differentially methylated cytosines (DMCs) among treatments. Most statistical methods for DMC detection do not consider the dependency of methylation patterns across the genome, thus possibly inflating type I error. Furthermore, small sample sizes and weak methylation effects among different phenotype categories make it difficult for these statistical methods to accurately detect DMCs. To address these issues, the wavelet-based functional mixed model (WFMM) was introduced to detect DMCs. To further examine the performance of WFMM in detecting weak differential methylation events, we used both simulated and empirical data and compare WFMM performance to a popular DMC detection tool methylKit. Analyses of simulated data that replicated the effects of the herbicide glyphosate on DNA methylation in Arabidopsis thaliana show that WFMM results in higher sensitivity and specificity in detecting DMCs compared to methylKit, especially when the methylation differences among phenotype groups are small. Moreover, the performance of WFMM is robust with respect to small sample sizes, making it particularly attractive considering the current high costs of bisulfite sequencing. Analysis of empirical Arabidopsis thaliana data under varying glyphosate dosages, and the analysis of monozygotic (MZ) twins who have different pain sensitivities—both datasets have weak methylation effects of <1%—show that WFMM can identify more relevant DMCs related to the phenotype of interest than methylKit. Differentially methylated regions (DMRs) are genomic regions with different DNA methylation status across biological samples. DMRs and DMCs are essentially the same concepts, with the only difference being how methylation information across the genome is summarized. If methylation levels are determined by grouping neighboring cytosine sites, then they are DMRs; if methylation levels are calculated based on single cytosines, they are DMCs.
- Identifying Transcriptional Regulatory Modules Among Different Chromatin States in Mouse Neural Stem CellsBanerjee, Sharmi; Zhu, Hongxiao; Tang, Man; Feng, Wu-chun; Wu, Xiaowei; Xie, Hehuang David (Frontiers, 2019-01-15)Gene expression regulation is a complex process involving the interplay between transcription factors and chromatin states. Significant progress has been made toward understanding the impact of chromatin states on gene expression. Nevertheless, the mechanism of transcription factors binding combinatorially in different chromatin states to enable selective regulation of gene expression remains an interesting research area. We introduce a nonparametric Bayesian clustering method for inhomogeneous Poisson processes to detect heterogeneous binding patterns of multiple proteins including transcription factors to form regulatory modules in different chromatin states. We applied this approach on ChIP-seq data for mouse neural stem cells containing 21 proteins and observed different groups or modules of proteins clustered within different chromatin states. These chromatin-state-specific regulatory modules were found to have significant influence on gene expression. We also observed different motif preferences for certain TFs between different chromatin states. Our results reveal a degree of interdependency between chromatin states and combinatorial binding of proteins in the complex transcriptional regulatory process. The software package is available on Github at - https://github.com/BSharmi/DPM-LGCP.
- Nonparametric Bayesian clustering to detect bipolar methylated genomic lociWu, Xiaowei; Sun, Ming-an; Zhu, Hongxiao; Xie, Hehuang (Biomed Central, 2015-01-16)Background: With recent development in sequencing technology, a large number of genome-wide DNA methylation studies have generated massive amounts of bisulfite sequencing data. The analysis of DNA methylation patterns helps researchers understand epigenetic regulatory mechanisms. Highly variable methylation patterns reflect stochastic fluctuations in DNA methylation, whereas well-structured methylation patterns imply deterministic methylation events. Among these methylation patterns, bipolar patterns are important as they may originate from allele-specific methylation (ASM) or cell-specific methylation (CSM). Results: Utilizing nonparametric Bayesian clustering followed by hypothesis testing, we have developed a novel statistical approach to identify bipolar methylated genomic regions in bisulfite sequencing data. Simulation studies demonstrate that the proposed method achieves good performance in terms of specificity and sensitivity. We used the method to analyze data from mouse brain and human blood methylomes. The bipolar methylated segments detected are found highly consistent with the differentially methylated regions identified by using purified cell subsets. Conclusions: Bipolar DNA methylation often indicates epigenetic heterogeneity caused by ASM or CSM. With allele-specific events filtered out or appropriately taken into account, our proposed approach sheds light on the identification of cell-specific genes/pathways under strong epigenetic control in a heterogeneous cell population.
- Numerical analysis of bat noseleaf dynamics and its impact on the encoding of sensory informationGupta, Anupam Kumar (Virginia Tech, 2017-02-06)Horseshoe bats possess a sophisticated biosonar system that helps them to negotiate complex unstructured environments by relying primarily on the sound as the far sense. For this, the bats emit brief ultrasonic pulses and listen to incoming echoes to learn about the environment. The sites of emission and reception in these bats are surrounded by baffle structures called "noseleaves" and "pinnae (outer ears)". These are the the only places in the biosonar system where direction-dependent information gets encoded. These baffle structures in bats unlike the engineering systems like megaphones have complex static geometry and can undergo fast deformations at the time of pulse emission/reception. However, the functional significance of the baffle motions in biosonar system is not known. The current work primarily focuses on: i) the study of the impact of noseleaf dynamics on the outgoing sound waves, ii) the study of the impact of baffle dynamics on encoding of sensory information and localization performance of bats. For this, we take a numerical approach where we use computer-animated digital models of bat noseleaves that mimic noseleaf dynamics as observed in bats. The shapes are acoustically characterized (beampatterns) numerically using a finite element implementation. These beampatterns are then analyzed using an information-theoretic approach. The followings findings were obtained: i) noseleaf dynamics altered the spatial distribution of energy, ii) baffle dynamics results in encoding of new sensory information, and iii) the new sensory information encoded due to baffle dynamics significantly improves the performance of biosonar system on the two target localization tasks evaluated here -- direction resolution and direction estimation accuracy. These results affirm the importance of dynamics in biosonar system of horseshoe bats and point at the possibility of biosonar dynamics as a key factor behind the astounding sensory capabilities of these animals that are not yet matched by engineering systems. Thus, these biosonar dynamic principles can help improve the man-made sensing systems and help close the performance gap between active sensing in biology and in engineering.
- Scalable Estimation and Testing for Complex, High-Dimensional DataLu, Ruijin (Virginia Tech, 2019-08-22)With modern high-throughput technologies, scientists can now collect high-dimensional data of various forms, including brain images, medical spectrum curves, engineering signals, etc. These data provide a rich source of information on disease development, cell evolvement, engineering systems, and many other scientific phenomena. To achieve a clearer understanding of the underlying mechanism, one needs a fast and reliable analytical approach to extract useful information from the wealth of data. The goal of this dissertation is to develop novel methods that enable scalable estimation, testing, and analysis of complex, high-dimensional data. It contains three parts: parameter estimation based on complex data, powerful testing of functional data, and the analysis of functional data supported on manifolds. The first part focuses on a family of parameter estimation problems in which the relationship between data and the underlying parameters cannot be explicitly specified using a likelihood function. We introduce a wavelet-based approximate Bayesian computation approach that is likelihood-free and computationally scalable. This approach will be applied to two applications: estimating mutation rates of a generalized birth-death process based on fluctuation experimental data and estimating the parameters of targets based on foliage echoes. The second part focuses on functional testing. We consider using multiple testing in basis-space via p-value guided compression. Our theoretical results demonstrate that, under regularity conditions, the Westfall-Young randomization test in basis space achieves strong control of family-wise error rate and asymptotic optimality. Furthermore, appropriate compression in basis space leads to improved power as compared to point-wise testing in data domain or basis-space testing without compression. The effectiveness of the proposed procedure is demonstrated through two applications: the detection of regions of spectral curves associated with pre-cancer using 1-dimensional fluorescence spectroscopy data and the detection of disease-related regions using 3-dimensional Alzheimer's Disease neuroimaging data. The third part focuses on analyzing data measured on the cortical surfaces of monkeys' brains during their early development, and subjects are measured on misaligned time markers. In this analysis, we examine the asymmetric patterns and increase/decrease trend in the monkeys' brains across time.