Statistical methods for variant discovery and functional genomic analysis using next-generation sequencing data

Tang, Man

Statistical methods for variant discovery and functional genomic analysis using next-generation sequencing data

dc.contributor.author	Tang, Man	en
dc.contributor.committeechair	Wu, Xiaowei	en
dc.contributor.committeemember	Zhang, Liqing	en
dc.contributor.committeemember	Kim, Inyoung	en
dc.contributor.committeemember	Franck, Christopher Thomas	en
dc.contributor.department	Statistics	en
dc.date.accessioned	2021-06-27T06:00:18Z	en
dc.date.available	2021-06-27T06:00:18Z	en
dc.date.issued	2020-01-03	en
dc.description.abstract	The development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data, allowing the identification of biomarkers in early disease diagnosis and driving the transformation of most disciplines in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. This dissertation focuses on modeling ``omics'' data in various NGS applications with a primary goal of developing novel statistical methods to identify sequence variants, find transcription factor (TF) binding patterns, and decode the relationship between TF and gene expression levels. Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in NGS applications. Existing methods for calling these variants often make simplified assumption of positional independence and fail to leverage the dependence of genotypes at nearby loci induced by linkage disequilibrium. We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short read data. Simulation experiments show that, under various sequencing depths, vi-HMM outperforms existing methods in terms of sensitivity and F1 score. When applied to the human whole genome sequencing data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs. One important NGS application is chromatin immunoprecipitation followed by sequencing (ChIP-seq), which characterizes protein-DNA relations through genome-wide mapping of TF binding sites. Multiple TFs, binding to DNA sequences, often show complex binding patterns, which indicate how TFs with similar functionalities work together to regulate the expression of target genes. To help uncover the transcriptional regulation mechanism, we propose a novel nonparametric Bayesian method to detect the clustering pattern of multiple-TF bindings from ChIP-seq datasets. Simulation study demonstrates that our method performs best with regard to precision, recall, and F1 score, in comparison to traditional methods. We also apply the method on real data and observe several TF clusters that have been recognized previously in mouse embryonic stem cells. Recent advances in ChIP-seq and RNA sequencing (RNA-Seq) technologies provides more reliable and accurate characterization of TF binding sites and gene expression measurements, which serves as a basis to study the regulatory functions of TFs on gene expression. We propose a log Gaussian cox process with wavelet-based functional model to quantify the relationship between TF binding site locations and gene expression levels. Through the simulation study, we demonstrate that our method performs well, especially with large sample size and small variance. It also shows a remarkable ability to distinguish real local feature in the function estimates.	en
dc.description.abstractgeneral	The development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data and bring out innovations in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. In this dissertation, we mainly focus on three problems closely related to NGS and its applications: (1) how to improve variant calling accuracy, (2) how to model transcription factor (TF) binding patterns, and (3) how to quantify of the contribution of TF binding on gene expression. We develop novel statistical methods to identify sequence variants, find TF binding patterns, and explore the relationship between TF binding and gene expressions. We expect our findings will be helpful in promoting a better understanding of disease causality and facilitating the design of personalized treatments.	en
dc.description.degree	Doctor of Philosophy	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:23707	en
dc.identifier.uri	http://hdl.handle.net/10919/104039	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	next-generation sequencing	en
dc.subject	hidden Markov model	en
dc.subject	variant calling	en
dc.subject	transcription factor	en
dc.subject	nonparametric Bayesian	en
dc.subject	log Gaussian Cox process	en
dc.subject	Dirichlet process mixture	en
dc.subject	gene expression	en
dc.subject	wavelet-based functional model	en
dc.title	Statistical methods for variant discovery and functional genomic analysis using next-generation sequencing data	en
dc.type	Dissertation	en
thesis.degree.discipline	Statistics	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Doctor of Philosophy	en

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Tang_M_D_2020.pdf
Size:: 2.05 MB
Format:: Adobe Portable Document Format

Download

Name:: Tang_M_D_2020_support_1.pdf
Size:: 72.21 KB
Format:: Adobe Portable Document Format
Description:: Supporting documents

Download

Collections

Doctoral Dissertations