Browsing by Author "Wu, Xiaowei"
Now showing 1 - 20 of 38
Results Per Page
Sort Options
- Antimicrobial Resistance Mitigation [ARM] Concept PaperVikesland, Peter J.; Alexander, Kathleen A.; Badgley, Brian D.; Krometis, Leigh-Anne H.; Knowlton, Katharine F.; Gohlke, Julia M.; Hall, Ralph P.; Hawley, Dana M.; Heath, Lenwood S.; Hession, W. Cully; Hull, Robert Bruce IV; Moeltner, Klaus; Ponder, Monica A.; Pruden, Amy; Schoenholtz, Stephen H.; Wu, Xiaowei; Xia, Kang; Zhang, Liqing (Virginia Tech, 2017-05-15)The development of viable solutions to the global threat of antimicrobial resistance requires a transdisciplinary approach that simultaneously considers the clinical, biological, social, economic, and environmental drivers responsible for this emerging threat. The vision of the Antimicrobial Resistance Mitigation (ARM) group is to build upon and leverage the present strengths of Virginia Tech in ARM research and education using a multifaceted systems approach. Such a framework will empower our group to recognize the interconnectedness and interdependent nature of this threat and enable the delineation, development, and testing of resilient approaches for its mitigation. We seek to develop innovative and sustainable approaches that radically advance detection, characterization, and prevention of antimicrobial resistance emergence and dissemination in human-dominated and natural settings...
- Association testing for binary trees-A Markov branching process approachWu, Xiaowei; Zhu, Hongxiao (Wiley, 2022-03-09)We propose a new approach to test associations between binary trees and covariates. In this approach, binary-tree structured data are treated as sample paths of binary fission Markov branching processes (bMBP). We propose a generalized linear regression model and developed inference procedures for association testing, including variable selection and estimation of covariate effects. Simulation studies show that these procedures are able to accurately identify covariates that are associated with the binary tree structure by impacting the rate parameter of the bMBP. The problem of association testing on binary trees is motivated by modeling hierarchical clustering dendrograms of pixel intensities in biomedical images. By using semi-synthetic data generated from a real brain-tumor image, our simulation studies show that the bMBP model is able to capture the characteristics of dendrogram trees in brain-tumor images. Our final analysis of the glioblastoma multiforme brain-tumor data from The Cancer Imaging Archive identified multiple clinical and genetic variables that are potentially associated with brain-tumor heterogeneity.
- A Bayesian Analysis of Copy Number Variations in Array Comparative Genomic Hybridization DataWu, Xiaowei; Zhu, Hongxiao (OMICS International, 2015-09-25)Array Comparative Genomic Hybridization (CGH) has been widely used for detecting genomic copy number variations (CNVs). The central goal of array CGH data analysis is to accurately detect homogeneous regions of log intensity ratios which represent relative changes in DNA copy number. Various methods have been proposed in recent years. Most methods, however, do not consider correlations of neighboring probe measurements, and are usually designed for analysis at single sample level rather than detecting common or recurrent CNVs among multiple samples. We propose a Bayesian segment-based approach for efficient analysis of array CGH data. The proposed method is based on simple assumptions but is general enough to accommodate various spatial correlations among probe measurements. It also allows for multiple samples with recurrent CNVs, therefore is able to borrow strength across samples. In contrast to another probe-based approach developed in the same Bayesian framework, the segment-based approach parameterizes the mean log intensity ratios in a more appropriate way, which leads to a posterior sampling scheme based on reversible-jump Markov chain Monte Carlo. We perform a simulation study to compare these two approaches and the commonly-used circular binary segmentation method and Bayesian hidden Markov model method. The segment-based approach achieves better estimation accuracy and higher computational efficiency compared to the probe-based approach, and also provides improved results compared to the other two methods, especially for data with relatively low signal to noise ratio and high correlation. The segment-based approach is further applied to the Corriel cell lines data and Pancreatic Adenocarcinoma data.
- Computational Approaches to Predict Effect of Epigenetic Modifications on Transcriptional Regulation of Gene ExpressionBanerjee, Sharmi (Virginia Tech, 2019-10-07)This dissertation presents applications of machine learning and statistical approaches to infer protein-DNA bindings in the presence of epigenetic modifications. Epigenetic modifications are alterations to the DNA resulting in gene expression regulation where the structure of the DNA remains unaltered. It is a heritable and reversible modification and often involves addition or deletion of certain chemical compounds to the DNA. Histone modification is an epigenetic change that involves alteration of the histone proteins – thus changing the chromatin (DNA wound around histone proteins) structure – or addition of methyl-groups to the Cytosine base adjacent to a Guanine base. Epigenetic factors often interfere in gene expression regulation by promoting or inhibiting protein-DNA bindings. Such proteins are known as transcription factors. Transcription is the first step of gene expression where a particular segment of DNA is copied into the messenger-RNA (mRNA). Transcription factors orchestrate gene activity and are crucial for normal cell function in any organism. For example, deletion/mutation of certain transcription factors such as MEF2 have been associated with neurological disorders such as autism and schizophrenia. In this dissertation, different computational pipelines are described that use mathematical models to explain how the protein-DNA bindings are mediated by histone modifications and DNA-methylation affecting different regions of the brain at different stages of development. Multi-layer Markov models, Inhomogeneous Poisson analyses are used on data from brain to show the impact of epigenetic factors on protein-DNA bindings. Such data driven approaches reinforce the importance of epigenetic factors in governing brain cell differentiation into different neuron types, regulation of memory and promotion of normal brain development at the early stages of life.
- COREMIC: a web-tool to search for a niche associated CORE MICrobiomeRodrigues, Richard R.; Rodgers, Nyle C.; Wu, Xiaowei; Williams, Mark A. (PeerJ, 2018-02-15)Microbial diversity on earth is extraordinary, and soils alone harbor thousands of species per, gram of soil. Understanding how this diversity is sorted and selected into habitat niches is a major focus of ecology and biotechnology but remains only vaguely understood. A systems-biology approach was used to mine information from databases to show how it can be used to answer questions related to the core microbiome of habitat-microbe relationships. By making use of the burgeoning growth of information from databases, our tool "COREMIC" meets a great need in the search for understanding niche partitioning and habitat-function relationships. The work is unique, furthermore because it provides a user-friendly statistically robust web-tool (hit p://cot eiruc2.appspot.corn or http://corc-mic.com), developed using Google App Engine, to help in the process of database mining to identify the "core microbiome" associated with a given habitat. A case study is presented using data from 31 switchgrass rhizosphere community habitats across a diverse set of soil and sampling environments. The methodology utilizes an outgroup of 28 non-switchgrass (other grasses and forbs) to identify a core switchgrass microbiome. Even across a diverse set of soils (five environments), and conservative statistical criteria (presence in more than 90% samples and FDR q-val <0.05% for Fisher's exact test) a core set of bacteria associated with switchgrass was observed. These included, among others, closely related taxa from Lysobacter spp., Mesorhizobium spp, and Chitinophagaceae. These bacteria have been shown to have functions related to the production of bacterial and fungal antibiotics and plant growth promotion. COREMIC can be used as a hypothesis generating or confirmatory tool that shows great potential for identifying taxa that may be irnportant to the functioning of a habitat (e.g. host plant). The case study, in conclusion, shows that COREMIC can identify key habitat-specific microbes across diverse samples, using currently available databases and a unique freely available software.
- 'Cut from the same cloth': Shared microsatellite variants among cancers link to ectodermal tissues-neural tube and crest cellsKarunasena, Enusha; McIver, Lauren J.; Bavarva, Jasmin H.; Wu, Xiaowei; Zhu, Hongxiao; Garner, Harold R. (Impact Journals, 2015-09-08)
- Detecting clusters of transcription factors based on a nonhomogeneous poisson process modelWu, Xiaowei; Liu, Shicheng; Liang, Guanying (2022-12-09)Background Rapidly growing genome-wide ChIP-seq data have provided unprecedented opportunities to explore transcription factor (TF) binding under various cellular conditions. Despite the rich resources, development of analytical methods for studying the interaction among TFs in gene regulation still lags behind. Results In order to address cooperative TF binding and detect TF clusters with coordinative functions, we have developed novel computational methods based on clustering the sample paths of nonhomogeneous Poisson processes. Simulation studies demonstrated the capability of these methods to accurately detect TF clusters and uncover the hierarchy of TF interactions. A further application to the multiple-TF ChIP-seq data in mouse embryonic stem cells (ESCs) showed that our methods identified the cluster of core ESC regulators reported in the literature and provided new insights on functional implications of transcrisptional regulatory modules. Conclusions Effective analytical tools are essential for studying protein-DNA relations. Information derived from this research will help us better understand the orchestration of transcription factors in gene regulation processes.
- Evaluating and Improving Performance of Bisulfite Short Reads Alignment and the Identification of Differentially Methylated SitesTran, Hong Thi Thanh (Virginia Tech, 2018-01-18)Large-scale bisulfite treatment and short reads sequencing technology allows comprehensive estimation of methylation states of Cs in the genomes of different tissues, cell types, and developmental stages. Accurate characterization of DNA methylation is essential for understanding genotype phenotype association, gene and environment interaction, diseases, and cancer. The thesis work first evaluates the performance of several commonly used bisulfite short read mappers and investigates how pre-processing data might affect the performance. Aligning bisulfite short reads to a reference genome remains a challenging task. In practice, only a limited proportion of bisulfite treated DNA reads can be mapped uniquely (around 50-70%) while a significant proportion of reads (called multireads) are aligned to multiple genomic locations. The thesis outlines a strategy to improve the mapping efficiencies of the existing bisulfite short reads software by finding unique locations for multireads. Analyses of both simulated data and real hairpin bisulfite sequencing data show that our strategy can effectively assign approximately 70% of the multireads to their best locations with up to 90% accuracy, leading to a significant increase in the overall mapping efficiency. The most common and essential downstream task in DNA methylation analysis is to detect differential methylated cytosines (DMCs). Although many statistical methods have been applied to detect DMCs, inconsistency in detecting differential methylated sites among statistical tools remains. We adapt the wavelet-based functional mixed models (WFMM) to detect DMCs. Analyses of simulated Arabidopsis data show that WFMM has higher sensitivities and specificities in detecting DMCs compared to existing methods especially when methylation differences are small. Analyses of monozygotic twin data who have different pain sensitivity also show that WFMM can find more relevant DMCs related to pain sensitivity compared to methylKit. In addition, we provide a strategy to modify the default settings in both WFMM and methylKit to be more tailored to a given methylation profile, thus improving the accuracy of detecting DMCs. Population growth and climate change leave billions of people around the world living in water scarcity conditions. Therefore, utility of reclaimed water (treated wastewater) is pivotal for water sustainability. Recently, researchers discovered microbial regrowth problems in reclaimed water distribution systems (RWDs). The third part of the thesis involves: 1) identifying fundamental conditions that affect proliferation of antibiotic resistance genes (ARGs), 2) identifying the effect of water chemistry and water age on microbial regrowth, and 3) characterizing co-occurrence of ARGs and/or mobile genetics elements (MGEs), i.e., plasmids in simulated RWDs. Analyses of preliminary results from simulated RWDs show that biofilms, bulk water environment, temperature, and disinfectant types have significant influence on shaping antibiotic resistant bacteria (ARB) communities. In particular, biofilms create a favorable environment for ARGs to diversify but with lower total ARG populations. ARGs are the least diverse at 300C and the most diverse at 220C. Disinfectants reduce ARG populations as well as ARG diversity. Chloramines keep ARG populations and diversity at the lowest rate. Disinfectants work better in bulk water environment than in biofilms in terms of shaping resistome. Network analysis on assembly data is done to determine which ARG pairs are the most co-occurred. Bayesian network is more consistent with the co-occurrence network constructed from assembly data than the network based on Spearman's correlation network of ARG abundance profiles.
- Evolutionary Genomics of Populus trichocarpa (Western Poplar)Bawa, Rajesh Kumar (Virginia Tech, 2017-08-15)Forest trees are an important pool of biodiversity at the gene, individual and an ecosystem level. This variation is a result of complex environmental interactions, as well as neutral and selective forces acting on populations. Patterns of standing genetic variation are the result of adaption to past and contemporary climate change, but also historical demographic events, and disentangling the role of these forces is a central problem in population genomics. The overall goal of this study is to characterize the relative effects of demography and selection in the genome of Populus trichocarpa, a riparian deciduous tree species of North America. Specifically, I used a variety of methods to summarize patterns of genetic diversity and population structure in P. trichocarpa, and to reconstruct its demographic history. I subsequently incorporated these demographic insights to guide the application of several methods to identify genome-wide targets of natural selection within and among rangewide populations adapted to heterogeneous selection regimes. Results of this study provide insights into the history of divergence and differentiation in P. trichocarpa populations and help us identify the functional genetic variants contributing to phenotypic divergence and fitness of the individuals in it.
- Identification of Differentially Methylated Sites with Weak Methylation EffectsTran, Hong T.; Zhu, Hongxiao; Wu, Xiaowei; Kim, Gunjune; Clarke, Christopher R.; Larose, Hailey; Haak, David C.; Askew, Shawn D.; Barney, Jacob; Westwood, James H.; Zhang, Liqing (MDPI, 2018-02-08)Deoxyribonucleic acid (DNA) methylation is an epigenetic alteration crucial for regulating stress responses. Identifying large-scale DNA methylation at single nucleotide resolution is made possible by whole genome bisulfite sequencing. An essential task following the generation of bisulfite sequencing data is to detect differentially methylated cytosines (DMCs) among treatments. Most statistical methods for DMC detection do not consider the dependency of methylation patterns across the genome, thus possibly inflating type I error. Furthermore, small sample sizes and weak methylation effects among different phenotype categories make it difficult for these statistical methods to accurately detect DMCs. To address these issues, the wavelet-based functional mixed model (WFMM) was introduced to detect DMCs. To further examine the performance of WFMM in detecting weak differential methylation events, we used both simulated and empirical data and compare WFMM performance to a popular DMC detection tool methylKit. Analyses of simulated data that replicated the effects of the herbicide glyphosate on DNA methylation in Arabidopsis thaliana show that WFMM results in higher sensitivity and specificity in detecting DMCs compared to methylKit, especially when the methylation differences among phenotype groups are small. Moreover, the performance of WFMM is robust with respect to small sample sizes, making it particularly attractive considering the current high costs of bisulfite sequencing. Analysis of empirical Arabidopsis thaliana data under varying glyphosate dosages, and the analysis of monozygotic (MZ) twins who have different pain sensitivities—both datasets have weak methylation effects of <1%—show that WFMM can identify more relevant DMCs related to the phenotype of interest than methylKit. Differentially methylated regions (DMRs) are genomic regions with different DNA methylation status across biological samples. DMRs and DMCs are essentially the same concepts, with the only difference being how methylation information across the genome is summarized. If methylation levels are determined by grouping neighboring cytosine sites, then they are DMRs; if methylation levels are calculated based on single cytosines, they are DMCs.
- Identifying and Analyzing Indel Variants in the Human Genome Using Computational ApproachesHasan, Mohammad Shabbir (Virginia Tech, 2019-07-01)Insertion and deletion (indel), a common form of genetic variation, has been shown to cause or contribute to human genetic diseases and cancer. Despite this importance and being the second most abundant variant type in the human genome, indels have not been studied as much as the single nucleotide polymorphism (SNP). With the advance of next-generation sequencing technology, many indel calling tools have been developed. However, performance comparison of commonly used tools has shown that (1) the tools have limited power in identifying indels and there are significant number of indels undetected, and (2) there is significant disagreement among the indel sets produced by the tools. These findings indicate the necessity of improving the existing tools or developing new algorithms to achieve reliable and consistent indel calling results. Two indels are biologically equivalent if the resulting sequences are the same. Storing biologically equivalent indels as distinct entries in databases causes data redundancy and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. This dissertation describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system. Results show that UPS-indel identifies more redundant indels than existing algorithms. While mapping short reads to the reference genome, a significant number of short reads are unmapped and excluded from downstream analyses, thereby causing information loss in the subsequent variant calling. This dissertation describes Genesis-indel, a computational pipeline that explores the unmapped reads to identify missing novel indels. Results analyzing sequence alignment of 30 breast cancer patients show that Genesis-indel identifies many novel indels that also show significant enrichment in oncogenes and tumor suppressor genes, demonstrating the importance of rescuing indels hidden in the unmapped reads in cancer and disease studies. Somatic mutations play a vital role in transforming healthy cells into cancer cells. Therefore, accurate identification of somatic mutations is essential. Many somatic mutations callers are available with different strengths and weaknesses. An ensemble approach integrating the power of the callers is warranted. This dissertation describes SomaticHunter, an ensemble of two callers, namely Platypus and VarDict. Results on synthetic tumor data show that for both SNPs and indels, SomaticHunter achieves recall comparable to the state-of-the-art somatic mutation callers and the highest precision, resulting in the highest F1 score.
- Identifying Transcriptional Regulatory Modules Among Different Chromatin States in Mouse Neural Stem CellsBanerjee, Sharmi; Zhu, Hongxiao; Tang, Man; Feng, Wu-chun; Wu, Xiaowei; Xie, Hehuang David (Frontiers, 2019-01-15)Gene expression regulation is a complex process involving the interplay between transcription factors and chromatin states. Significant progress has been made toward understanding the impact of chromatin states on gene expression. Nevertheless, the mechanism of transcription factors binding combinatorially in different chromatin states to enable selective regulation of gene expression remains an interesting research area. We introduce a nonparametric Bayesian clustering method for inhomogeneous Poisson processes to detect heterogeneous binding patterns of multiple proteins including transcription factors to form regulatory modules in different chromatin states. We applied this approach on ChIP-seq data for mouse neural stem cells containing 21 proteins and observed different groups or modules of proteins clustered within different chromatin states. These chromatin-state-specific regulatory modules were found to have significant influence on gene expression. We also observed different motif preferences for certain TFs between different chromatin states. Our results reveal a degree of interdependency between chromatin states and combinatorial binding of proteins in the complex transcriptional regulatory process. The software package is available on Github at - https://github.com/BSharmi/DPM-LGCP.
- The impact of spatial correlation on methylation entropy with application to mouse brain methylomeWu, Xiaowei; Choi, Joung Min (2023-02-04)Background With the advance of bisulfite sequencing technologies, massive amount of methylation data have been generated, which provide unprecedented opportunities to study the epigenetic mechanism and its relationship to other biological processes. A commonly seen feature of the methylation data is the correlation between nearby CpG sites. Although such a spatial correlation was utilized in several epigenetic studies, its interaction to other characteristics of the methylation data has not been fully investigated. Results We filled this research gap from an information theoretic perspective, by exploring the impact of the spatial correlation on the methylation entropy (ME). With the spatial correlation taken into account, we derived the analytical relation between the ME and another key parameter, the methylation probability. By comparing it to the empirical relation between the two corresponding statistics, the observed ME and the mean methylation level, genomic loci under strong epigenetic control can be identified, which may serve as potential markers for cell-type specific methylation. The proposed method was validated by simulation studies, and applied to analyze a published dataset of mouse brain methylome. Conclusions Compared to other sophisticated methods developed in literature, the proposed method provides a simple but effective way to detect CpG segments under strong epigenetic control (e.g., with bipolar methylation pattern). Findings from this study shed light on the identification of cell-type specific genes/pathways based on methylation data from a mixed cell population.
- Integrative single-cell omics analyses reveal epigenetic heterogeneity in mouse embryonic stem cellsLuo, Yanting; He, Jianlin; Xu, Xiguang; Sun, Ming-an; Wu, Xiaowei; Lu, Xuemei; Xie, Hehuang David (PLOS, 2018-03)Embryonic stem cells (ESCs) consist of a population of self-renewing cells displaying extensive phenotypic and functional heterogeneity. Research towards the understanding of the epigenetic mechanisms underlying the heterogeneity among ESCs is still in its initial stage. Key issues, such as how to identify cell-subset specifically methylated loci and how to interpret the biological meanings of methylation variations remain largely unexplored. To fill in the research gap, we implemented a computational pipeline to analyze single-cell methylome and to perform an integrative analysis with single-cell transcriptome data. According to the origins of variation in DNA methylation, we determined the genomic loci associated with allelic-specific methylation or asymmetric DNA methylation, and explored a beta mixture model to infer the genomic loci exhibiting cell-subset specific methylation (CSM). We observed that the putative CSM loci in ESCs are significantly enriched in CpG island (CGI) shelves and regions with histone marks for promoter and enhancer, and the genes hosting putative CSM loci show wide-ranging expression among ESCs. More interestingly, the putative CSM loci may be clustered into co-methylated modules enriching the binding motifs of distinct sets of transcription factors. Taken together, our study provided a novel tool to explore single-cell methylome and transcriptome to reveal the underlying transcriptional regulatory networks associated with epigenetic heterogeneity of ESCs.
- Jaccard distance based weighted sparse representation for coarse-to-fine plant species recognitionZhang, Shanwen; Wu, Xiaowei; You, Zhuhong (PLOS, 2017-06-07)Leaf based plant species recognition plays an important role in ecological protection, however its application to large and modern leaf databases has been a long-standing obstacle due to the computational cost and feasibility. Recognizing such limitations, we propose a Jaccard distance based sparse representation (JDSR) method which adopts a two-stage, coarse to fine strategy for plant species recognition. In the first stage, we use the Jaccard distance between the test sample and each training sample to coarsely determine the candidate classes of the test sample. The second stage includes a Jaccard distance based weighted sparse representation based classification(WSRC), which aims to approximately represent the test sample in the training space, and classify it by the approximation residuals. Since the training model of our JDSR method involves much fewer but more informative representatives, this method is expected to overcome the limitation of high computational and memory costs in traditional sparse representation based classification. Comparative experimental results on a public leaf image database demonstrate that the proposed method outperforms other existing feature extraction and SRC based plant recognition methods in terms of both accuracy and computational speed.
- Mapping Bisulfite-Treated Short DNA ReadsPorter, Jacob Stuart (Virginia Tech, 2018-04-23)Epigenetics are stable heritable traits that are not a result of the DNA sequence. Epigenetic modification of DNA cytosine plays a role in development and disease. The covalent bonding of a methyl group or a hydroxymethyl group to the 5-carbon of cytosine epigenetically modifies cytosine to 5-methylcytosine or 5-hydroxymethylcytosine. Upon PCR amplification, the bisulfite treatment of DNA converts unmethylated cytosine to thymine, while 5-methylcytosine, 5-hydroxymethylcytosine, and other bases remain unchanged. The resulting sequences can be mapped to a reference genome; however, this can be challenging due to sequencing technology complexity, low sequence complexity, and biases and errors introduced with bisulfite treatment. Once the short read is mapped, the identity of 5-methylcytosine or 5-hydroxymethylcytosine can be determined by comparing the mapped read to the aligned reference genome. Bisulfite DNA read mapping is characterized by mapping performance as low as 40%. This research improves bisulfite short read mapping quality. First, reads generated from the bisulfite hairpin PCR protocol are used to study mapping failure and solutions. A read may not map to the genome; it may map uniquely, or it may map to multiple locations. Sequence complexity correlates with these mapping categories. The hairpin protocol allows for a recovery, in some cases, of the original untreated read, and mapping this read with the regular read mapper Bowtie2 improved mapper performance by 10%. New bisulfite read mapping software called BisPin was created that calls BFAST (BLAT-like Fast Accurate Search Tool) for mapping. BisPin resolves ambiguously mapped reads with a rescoring strategy, which yields a statistically significant improvement. BFAST-Gap for Ion Torrent reads was developed, since Ion Torrent machines are less expensive than Illumina machines and since Ion Torrent reads are longer. There are few mappers for Ion Torrent data. BFAST-Gap uses homopolymer run length for contextual gap penalty functions, since homopolymer runs cause errors in Ion Torrent reads. In conjunction with BisPin, this software performed well on real and simulated bisulfite Ion Torrent data and Illumina data. InfoTrim, a read trimmer with an entropy term, was developed with competitive results.
- Modeling Neutral Evolution Using an Infinite-Allele Markov Branching ProcessWu, Xiaowei; Kimmel, Marek (Hindawi, 2013-03-17)We consider an infinite-allele Markov branching process (IAMBP). Our main focus is the frequency spectrum of this process, that is, the proportion of alleles having a given number of copies at a specified time point. We derive the variance of the frequency spectrum, which is useful for interval estimation and hypothesis testing for process parameters. In addition, for a class of special IAMBP with birth and death offspring distribution, we show that the mean of its limiting frequency spectrum has an explicit form in terms of the hypergeometric function. We also derive an asymptotic expression for convergence rate to the limit. Simulations are used to illustrate the results for the birth and death process.
- Nonparametric Bayesian clustering to detect bipolar methylated genomic lociWu, Xiaowei; Sun, Ming-an; Zhu, Hongxiao; Xie, Hehuang (Biomed Central, 2015-01-16)Background: With recent development in sequencing technology, a large number of genome-wide DNA methylation studies have generated massive amounts of bisulfite sequencing data. The analysis of DNA methylation patterns helps researchers understand epigenetic regulatory mechanisms. Highly variable methylation patterns reflect stochastic fluctuations in DNA methylation, whereas well-structured methylation patterns imply deterministic methylation events. Among these methylation patterns, bipolar patterns are important as they may originate from allele-specific methylation (ASM) or cell-specific methylation (CSM). Results: Utilizing nonparametric Bayesian clustering followed by hypothesis testing, we have developed a novel statistical approach to identify bipolar methylated genomic regions in bisulfite sequencing data. Simulation studies demonstrate that the proposed method achieves good performance in terms of specificity and sensitivity. We used the method to analyze data from mouse brain and human blood methylomes. The bipolar methylated segments detected are found highly consistent with the differentially methylated regions identified by using purified cell subsets. Conclusions: Bipolar DNA methylation often indicates epigenetic heterogeneity caused by ASM or CSM. With allele-specific events filtered out or appropriately taken into account, our proposed approach sheds light on the identification of cell-specific genes/pathways under strong epigenetic control in a heterogeneous cell population.
- Novel Statistical Methods for Multiple-variant Genetic Association Studies with Related IndividualsGuan, Ting (Virginia Tech, 2018-07-09)Genetic association studies usually include related individuals. Meanwhile, high-throughput sequencing technologies produce data of multiple genetic variants. Due to linkage disequilibrium (LD) and familial relatedness, the genotype data from such studies often carries complex correlations. Moreover, missing values in genotype usually lead to loss of power in genetic association tests. Also, repeated measurements of phenotype and dynamic covariates from longitudinal studies bring in more opportunities but also challenges in the discovery of disease-related genetic factors. This dissertation focuses on developing novel statistical methods to address some challenging questions remaining in genetic association studies due to the aforementioned reasons. So far, a lot of methods have been proposed to detect disease-related genetic regions (e.g., genes, pathways). However, with multiple-variant data from a sample with relatedness, it is critical to account for the complex genotypic correlations when assessing genetic contribution. Recognizing the limitations of existing methods, in the first work of this dissertation, the Adaptive-weight Burden Test (ABT) --- a score test between a quantitative trait and the genotype data with complex correlations --- is proposed. ABT achieves higher power by adopting data-driven weights, which make good use of the LD and relatedness. Because the null distribution has been successfully derived, the computational simplicity of ABT makes it a good fit for genome-wide association studies. Genotype missingness commonly arises due to limitations in genotyping technologies. Imputation of the missing values in genotype usually improves quality of the data used in the subsequent association test and thus increases power. Complex correlations, though troublesome, provide the opportunity to proper handling of genotypic missingness. In the second part of this dissertation, a genotype imputation method is developed, which can impute the missingness in multiple genetic variants via the LD and the relatedness. The popularity of longitudinal studies in genetics and genomics calls for methods deliberately designed for repeated measurements. Therefore, a multiple-variant genetic association test for a longitudinal trait on samples with relatedness is developed, which treats the longitudinal measurements as observations of functions and thus takes into account the time factor properly.
- Performance evaluation of indel calling tools using real short-read dataHasan, Mohammad Shabbir; Wu, Xiaowei; Zhang, Liqing (Biomed Central, 2015-08-19)Background Insertion and deletion (indel), a common form of genetic variation, has been shown to cause or contribute to human genetic diseases and cancer. With the advance of next-generation sequencing technology, many indel calling tools have been developed; however, evaluation and comparison of these tools using large-scale real data are still scant. Here we evaluated seven popular and publicly available indel calling tools, GATK Unified Genotyper, VarScan, Pindel, SAMtools, Dindel, GTAK HaplotypeCaller, and Platypus, using 78 human genome low-coverage data from the 1000 Genomes project. Results Comparing indels called by these tools with a known set of indels, we found that Platypus outperforms other tools. In addition, a high percentage of known indels still remain undetected and the number of common indels called by all seven tools is very low. Conclusion All these findings indicate the necessity of improving the existing tools or developing new algorithms to achieve reliable and consistent indel calling results.