Machine Learning to Interrogate High-throughput Genomic Data: Theory and Applications

Yu, Guoqiang

Machine Learning to Interrogate High-throughput Genomic Data: Theory and Applications

dc.contributor.author	Yu, Guoqiang	en
dc.contributor.committeechair	Wang, Yue J.	en
dc.contributor.committeecochair	Xuan, Jianhua	en
dc.contributor.committeemember	Baumann, William T.	en
dc.contributor.committeemember	Lu, Chang-Tien	en
dc.contributor.committeemember	Wang, Ge	en
dc.contributor.committeemember	Clarke, Robert	en
dc.contributor.department	Electrical and Computer Engineering	en
dc.date.accessioned	2014-03-14T20:16:16Z	en
dc.date.adate	2011-09-19	en
dc.date.available	2014-03-14T20:16:16Z	en
dc.date.issued	2011-09-07	en
dc.date.rdate	2011-09-19	en
dc.date.sdate	2011-09-14	en
dc.description.abstract	The missing heritability in genome-wide association studies (GWAS) is an intriguing open scientific problem which has attracted great recent interest. The interaction effects among risk factors, both genetic and environmental, are hypothesized to be one of the main missing heritability sources. Moreover, detection of multilocus interaction effect may also have great implications for revealing disease/biological mechanisms, for accurate risk prediction, personalized clinical management, and targeted drug design. However, current analysis of GWAS largely ignores interaction effects, partly due to the lack of tools that meet the statistical and computational challenges posed by taking into account interaction effects. Here, we propose a novel statistically-based framework (Significant Conditional Association) for systematically exploring, assessing significance, and detecting interaction effect. Further, our SCA work has also revealed new theoretical results and insights on interaction detection, as well as theoretical performance bounds. Using in silico data, we show that the new approach has detection power significantly better than that of peer methods, while controlling the running time within a permissible range. More importantly, we applied our methods on several real data sets, confirming well-validated interactions with more convincing evidence (generating smaller p-values and requiring fewer samples) than those obtained through conventional methods, eliminating inconsistent results in the original reports, and observing novel discoveries that are otherwise undetectable. The proposed methods provide a useful tool to mine new knowledge from existing GWAS and generate new hypotheses for further research. Microarray gene expression studies provide new opportunities for the molecular characterization of heterogeneous diseases. Multiclass gene selection is an imperative task for identifying phenotype-associated mechanistic genes and achieving accurate diagnostic classification. Most existing multiclass gene selection methods heavily rely on the direct extension of two-class gene selection methods. However, simple extensions of binary discriminant analysis to multiclass gene selection are suboptimal and not well-matched to the unique characteristics of the multi-category classification problem. We report a simpler and yet more accurate strategy than previous works for multicategory classification of heterogeneous diseases. Our method selects the union of one-versus-everyone phenotypic up-regulated genes (OVEPUGs) and matches this gene selection with a one-versus-rest support vector machine. Our approach provides even-handed gene resources for discriminating both neighboring and well-separated classes, and intends to assure the statistical reproducibility and biological plausibility of the selected genes. We evaluated the fold changes of OVEPUGs and found that only a small number of high-ranked genes were required to achieve superior accuracy for multicategory classification. We tested the proposed OVEPUG method on six real microarray gene expression data sets (five public benchmarks and one in-house data set) and two simulation data sets, observing significantly improved performance with lower error rates, fewer marker genes, and higher performance sustainability, as compared to several widely-adopted gene selection and classification methods.	en
dc.description.degree	Ph. D.	en
dc.identifier.other	etd-09142011-112421	en
dc.identifier.sourceurl	http://scholar.lib.vt.edu/theses/available/etd-09142011-112421/	en
dc.identifier.uri	http://hdl.handle.net/10919/28980	en
dc.publisher	Virginia Tech	en
dc.relation.haspart	Yu_G_D_2011.pdf	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Gene-Environment Interaction	en
dc.subject	Gene-Gene Interaction	en
dc.subject	Multi-category gene selection	en
dc.subject	Genome-wide Association Study	en
dc.title	Machine Learning to Interrogate High-throughput Genomic Data: Theory and Applications	en
dc.type	Dissertation	en
thesis.degree.discipline	Electrical and Computer Engineering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Yu_G_D_2011.pdf
Size:: 3.89 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations