Machine Learning to Interrogate High-throughput Genomic Data: Theory and Applications

dc.contributor.authorYu, Guoqiangen
dc.contributor.committeechairWang, Yue J.en
dc.contributor.committeecochairXuan, Jianhuaen
dc.contributor.committeememberBaumann, William T.en
dc.contributor.committeememberLu, Chang-Tienen
dc.contributor.committeememberWang, Geen
dc.contributor.committeememberClarke, Roberten
dc.contributor.departmentElectrical and Computer Engineeringen
dc.date.accessioned2014-03-14T20:16:16Zen
dc.date.adate2011-09-19en
dc.date.available2014-03-14T20:16:16Zen
dc.date.issued2011-09-07en
dc.date.rdate2011-09-19en
dc.date.sdate2011-09-14en
dc.description.abstractThe missing heritability in genome-wide association studies (GWAS) is an intriguing open scientific problem which has attracted great recent interest. The interaction effects among risk factors, both genetic and environmental, are hypothesized to be one of the main missing heritability sources. Moreover, detection of multilocus interaction effect may also have great implications for revealing disease/biological mechanisms, for accurate risk prediction, personalized clinical management, and targeted drug design. However, current analysis of GWAS largely ignores interaction effects, partly due to the lack of tools that meet the statistical and computational challenges posed by taking into account interaction effects. Here, we propose a novel statistically-based framework (Significant Conditional Association) for systematically exploring, assessing significance, and detecting interaction effect. Further, our SCA work has also revealed new theoretical results and insights on interaction detection, as well as theoretical performance bounds. Using in silico data, we show that the new approach has detection power significantly better than that of peer methods, while controlling the running time within a permissible range. More importantly, we applied our methods on several real data sets, confirming well-validated interactions with more convincing evidence (generating smaller p-values and requiring fewer samples) than those obtained through conventional methods, eliminating inconsistent results in the original reports, and observing novel discoveries that are otherwise undetectable. The proposed methods provide a useful tool to mine new knowledge from existing GWAS and generate new hypotheses for further research. Microarray gene expression studies provide new opportunities for the molecular characterization of heterogeneous diseases. Multiclass gene selection is an imperative task for identifying phenotype-associated mechanistic genes and achieving accurate diagnostic classification. Most existing multiclass gene selection methods heavily rely on the direct extension of two-class gene selection methods. However, simple extensions of binary discriminant analysis to multiclass gene selection are suboptimal and not well-matched to the unique characteristics of the multi-category classification problem. We report a simpler and yet more accurate strategy than previous works for multicategory classification of heterogeneous diseases. Our method selects the union of one-versus-everyone phenotypic up-regulated genes (OVEPUGs) and matches this gene selection with a one-versus-rest support vector machine. Our approach provides even-handed gene resources for discriminating both neighboring and well-separated classes, and intends to assure the statistical reproducibility and biological plausibility of the selected genes. We evaluated the fold changes of OVEPUGs and found that only a small number of high-ranked genes were required to achieve superior accuracy for multicategory classification. We tested the proposed OVEPUG method on six real microarray gene expression data sets (five public benchmarks and one in-house data set) and two simulation data sets, observing significantly improved performance with lower error rates, fewer marker genes, and higher performance sustainability, as compared to several widely-adopted gene selection and classification methods.en
dc.description.degreePh. D.en
dc.identifier.otheretd-09142011-112421en
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-09142011-112421/en
dc.identifier.urihttp://hdl.handle.net/10919/28980en
dc.publisherVirginia Techen
dc.relation.haspartYu_G_D_2011.pdfen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectGene-Environment Interactionen
dc.subjectGene-Gene Interactionen
dc.subjectMulti-category gene selectionen
dc.subjectGenome-wide Association Studyen
dc.titleMachine Learning to Interrogate High-throughput Genomic Data: Theory and Applicationsen
dc.typeDissertationen
thesis.degree.disciplineElectrical and Computer Engineeringen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Yu_G_D_2011.pdf
Size:
3.89 MB
Format:
Adobe Portable Document Format