Dimensionality Reduction, Feature Selection and Visualization of Biological Data

Ha, Sook Shin

Dimensionality Reduction, Feature Selection and Visualization of Biological Data

dc.contributor.author	Ha, Sook Shin	en
dc.contributor.committeechair	Xuan, Jianhua Jason	en
dc.contributor.committeemember	Yang, Yaling	en
dc.contributor.committeemember	Kim, Inyoung	en
dc.contributor.committeemember	Lu, Chang-Tien	en
dc.contributor.committeemember	Wang, Yue J.	en
dc.contributor.department	Electrical and Computer Engineering	en
dc.date.accessioned	2017-04-06T15:43:22Z	en
dc.date.adate	2012-09-14	en
dc.date.available	2017-04-06T15:43:22Z	en
dc.date.issued	2012-08-08	en
dc.date.rdate	2016-09-30	en
dc.date.sdate	2012-08-21	en
dc.description.abstract	Due to the high dimensionality of most biological data, it is a difficult task to directly analyze, model and visualize the data to gain biological insight. Thus, dimensionality reduction becomes an imperative pre-processing step in analyzing and visualizing high-dimensional biological data. Two major approaches to dimensionality reduction in genomic analysis and biomarker identification studies are: Feature extraction, creating new features by combining existing ones based on a mapping technique; and feature selection, choosing an optimal subset of all features based on an objective function. In this dissertation, we show how our innovative reduction schemes effectively reduce the dimensionality of DNA gene expression data to extract biologically interpretable and relevant features which result in enhancing the biomarker identification process. To construct biologically interpretable features and facilitate Muscular Dystrophy (MD) subtypes classification, we extract molecular features from MD microarray data by constructing sub-networks using a novel integrative scheme which utilizes protein-protein interaction (PPI) network, functional gene sets information and mRNA profiling data. The workflow includes three major steps: First, by combining PPI network structure and gene-gene co-expression relationship into a new distance metric, we apply affinity propagation clustering (APC) to build gene sub-networks; secondly, we further incorporate functional gene sets knowledge to complement the physical interaction information; finally, based on the constructed sub-network and gene set features, we apply multi-class support vector machine (MSVM) for MD sub-type classification and highlight the biomarkers contributing to the sub-type prediction. The experimental results show that our scheme could construct sub-networks that are more relevant to MD than those constructed by the conventional approach. Furthermore, our integrative strategy substantially improved the prediction accuracy, especially for those â€˜hard-to-classify' sub-types. Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proven incorrect and applying uniform weight in the pathway analysis may not be an adequate approach for tasks like molecular classification of diseases, as genes in a functional group may have different differential power. Hence, we propose to use different weights for the pathway analysis which resulted in the development of four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight. To help us understand our MD expression data better and derive scientific insight from it, we have explored a suite of visualization tools. Particularly, for selected top performing MD sub-networks, we displayed the network view using Cytoscape; functional annotations using IPA and DAVID functional analysis tools; expression pattern using heat-map and parallel coordinates plot; and MD associated pathways using KEGG pathway diagrams. We also performed weighted MD pathway analysis, and identified overlapping sub-networks across different weight schemes and different MD subtypes using Venn Diagrams, which resulted in the identification of a new sub-network significantly associated with MD. All those graphically displayed data and information helped us understand our MD data and the MD subtypes better, resulting in the identification of several potentially MD associated biomarker pathways and genes.	en
dc.description.degree	Ph. D.	en
dc.identifier.other	etd-08212012-154849	en
dc.identifier.sourceurl	http://scholar.lib.vt.edu/theses/available/etd-08212012-154849/	en
dc.identifier.uri	http://hdl.handle.net/10919/77169	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Gene Expression	en
dc.subject	Feature Selection	en
dc.subject	Dimensionality Reduction	en
dc.subject	PPI network	en
dc.subject	Pathways	en
dc.subject	Visualization	en
dc.subject	Weight	en
dc.title	Dimensionality Reduction, Feature Selection and Visualization of Biological Data	en
dc.type	Dissertation	en
dc.type.dcmitype	Text	en
thesis.degree.discipline	Electrical and Computer Engineering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: etd-08212012-154849_HA_SOOK_S_D_2012.pdf
Size:: 6.16 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations