Statistical methods for variant discovery and functional genomic analysis using next-generation sequencing data

dc.contributor.authorTang, Manen
dc.contributor.committeechairWu, Xiaoweien
dc.contributor.committeememberZhang, Liqingen
dc.contributor.committeememberKim, Inyoungen
dc.contributor.committeememberFranck, Christopher Thomasen
dc.contributor.departmentStatisticsen
dc.date.accessioned2021-06-27T06:00:18Zen
dc.date.available2021-06-27T06:00:18Zen
dc.date.issued2020-01-03en
dc.description.abstractThe development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data, allowing the identification of biomarkers in early disease diagnosis and driving the transformation of most disciplines in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. This dissertation focuses on modeling ``omics'' data in various NGS applications with a primary goal of developing novel statistical methods to identify sequence variants, find transcription factor (TF) binding patterns, and decode the relationship between TF and gene expression levels. Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in NGS applications. Existing methods for calling these variants often make simplified assumption of positional independence and fail to leverage the dependence of genotypes at nearby loci induced by linkage disequilibrium. We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short read data. Simulation experiments show that, under various sequencing depths, vi-HMM outperforms existing methods in terms of sensitivity and F1 score. When applied to the human whole genome sequencing data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs. One important NGS application is chromatin immunoprecipitation followed by sequencing (ChIP-seq), which characterizes protein-DNA relations through genome-wide mapping of TF binding sites. Multiple TFs, binding to DNA sequences, often show complex binding patterns, which indicate how TFs with similar functionalities work together to regulate the expression of target genes. To help uncover the transcriptional regulation mechanism, we propose a novel nonparametric Bayesian method to detect the clustering pattern of multiple-TF bindings from ChIP-seq datasets. Simulation study demonstrates that our method performs best with regard to precision, recall, and F1 score, in comparison to traditional methods. We also apply the method on real data and observe several TF clusters that have been recognized previously in mouse embryonic stem cells. Recent advances in ChIP-seq and RNA sequencing (RNA-Seq) technologies provides more reliable and accurate characterization of TF binding sites and gene expression measurements, which serves as a basis to study the regulatory functions of TFs on gene expression. We propose a log Gaussian cox process with wavelet-based functional model to quantify the relationship between TF binding site locations and gene expression levels. Through the simulation study, we demonstrate that our method performs well, especially with large sample size and small variance. It also shows a remarkable ability to distinguish real local feature in the function estimates.en
dc.description.abstractgeneralThe development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data and bring out innovations in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. In this dissertation, we mainly focus on three problems closely related to NGS and its applications: (1) how to improve variant calling accuracy, (2) how to model transcription factor (TF) binding patterns, and (3) how to quantify of the contribution of TF binding on gene expression. We develop novel statistical methods to identify sequence variants, find TF binding patterns, and explore the relationship between TF binding and gene expressions. We expect our findings will be helpful in promoting a better understanding of disease causality and facilitating the design of personalized treatments.en
dc.description.degreeDoctor of Philosophyen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:23707en
dc.identifier.urihttp://hdl.handle.net/10919/104039en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectnext-generation sequencingen
dc.subjecthidden Markov modelen
dc.subjectvariant callingen
dc.subjecttranscription factoren
dc.subjectnonparametric Bayesianen
dc.subjectlog Gaussian Cox processen
dc.subjectDirichlet process mixtureen
dc.subjectgene expressionen
dc.subjectwavelet-based functional modelen
dc.titleStatistical methods for variant discovery and functional genomic analysis using next-generation sequencing dataen
dc.typeDissertationen
thesis.degree.disciplineStatisticsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.nameDoctor of Philosophyen

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Tang_M_D_2020.pdf
Size:
2.05 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
Tang_M_D_2020_support_1.pdf
Size:
72.21 KB
Format:
Adobe Portable Document Format
Description:
Supporting documents