Evaluating and Improving Performance of Bisulfite Short Reads Alignment and the Identification of Differentially Methylated Sites

dc.contributor.authorTran, Hong Thi Thanhen
dc.contributor.committeechairZhang, Liqingen
dc.contributor.committeememberWu, Xiaoweien
dc.contributor.committeememberZhu, Hongxiaoen
dc.contributor.committeememberXie, Hehuang Daviden
dc.contributor.departmentGenetics, Bioinformatics, and Computational Biologyen
dc.date.accessioned2018-01-19T09:00:46Zen
dc.date.available2018-01-19T09:00:46Zen
dc.date.issued2018-01-18en
dc.description.abstractLarge-scale bisulfite treatment and short reads sequencing technology allows comprehensive estimation of methylation states of Cs in the genomes of different tissues, cell types, and developmental stages. Accurate characterization of DNA methylation is essential for understanding genotype phenotype association, gene and environment interaction, diseases, and cancer. The thesis work first evaluates the performance of several commonly used bisulfite short read mappers and investigates how pre-processing data might affect the performance. Aligning bisulfite short reads to a reference genome remains a challenging task. In practice, only a limited proportion of bisulfite treated DNA reads can be mapped uniquely (around 50-70%) while a significant proportion of reads (called multireads) are aligned to multiple genomic locations. The thesis outlines a strategy to improve the mapping efficiencies of the existing bisulfite short reads software by finding unique locations for multireads. Analyses of both simulated data and real hairpin bisulfite sequencing data show that our strategy can effectively assign approximately 70% of the multireads to their best locations with up to 90% accuracy, leading to a significant increase in the overall mapping efficiency. The most common and essential downstream task in DNA methylation analysis is to detect differential methylated cytosines (DMCs). Although many statistical methods have been applied to detect DMCs, inconsistency in detecting differential methylated sites among statistical tools remains. We adapt the wavelet-based functional mixed models (WFMM) to detect DMCs. Analyses of simulated Arabidopsis data show that WFMM has higher sensitivities and specificities in detecting DMCs compared to existing methods especially when methylation differences are small. Analyses of monozygotic twin data who have different pain sensitivity also show that WFMM can find more relevant DMCs related to pain sensitivity compared to methylKit. In addition, we provide a strategy to modify the default settings in both WFMM and methylKit to be more tailored to a given methylation profile, thus improving the accuracy of detecting DMCs. Population growth and climate change leave billions of people around the world living in water scarcity conditions. Therefore, utility of reclaimed water (treated wastewater) is pivotal for water sustainability. Recently, researchers discovered microbial regrowth problems in reclaimed water distribution systems (RWDs). The third part of the thesis involves: 1) identifying fundamental conditions that affect proliferation of antibiotic resistance genes (ARGs), 2) identifying the effect of water chemistry and water age on microbial regrowth, and 3) characterizing co-occurrence of ARGs and/or mobile genetics elements (MGEs), i.e., plasmids in simulated RWDs. Analyses of preliminary results from simulated RWDs show that biofilms, bulk water environment, temperature, and disinfectant types have significant influence on shaping antibiotic resistant bacteria (ARB) communities. In particular, biofilms create a favorable environment for ARGs to diversify but with lower total ARG populations. ARGs are the least diverse at 300C and the most diverse at 220C. Disinfectants reduce ARG populations as well as ARG diversity. Chloramines keep ARG populations and diversity at the lowest rate. Disinfectants work better in bulk water environment than in biofilms in terms of shaping resistome. Network analysis on assembly data is done to determine which ARG pairs are the most co-occurred. Bayesian network is more consistent with the co-occurrence network constructed from assembly data than the network based on Spearman's correlation network of ARG abundance profiles.en
dc.description.degreePh. D.en
dc.format.mediumETDen
dc.identifier.othervt_gsexam:13563en
dc.identifier.urihttp://hdl.handle.net/10919/81861en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectSequence Analysisen
dc.subjectMethylationen
dc.subjectBisulfite short next generation sequence mappingen
dc.subjectBayesian statisticsen
dc.titleEvaluating and Improving Performance of Bisulfite Short Reads Alignment and the Identification of Differentially Methylated Sitesen
dc.typeDissertationen
thesis.degree.disciplineGenetics, Bioinformatics, and Computational Biologyen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Tran_HT_D_2018.pdf
Size:
4.75 MB
Format:
Adobe Portable Document Format