From network to pathway: integrative network analysis of genomic data
MetadataShow full item record
The advent of various types of high-throughput genomic data has enabled researchers to investigate complex biological systems in a systemic way and started to shed light on the underlying molecular mechanisms in cancers. To analyze huge amounts of genomic data, effective statistical and machine learning tools are clearly needed; more importantly, integrative approaches are especially needed to combine different types of genomic data for a network or pathway view of biological systems. Motivated by such needs, we make efforts in this dissertation to develop integrative framework for pathway analysis. Specifically, we dissect the molecular pathway into two parts: protein-DNA interaction network and protein-protein interaction network. Several novel approaches are proposed to integrate gene expression data with various forms of biological knowledge, such as protein-DNA interaction and protein-protein interaction for reliable molecular network identification. The first part of this dissertation seeks to infer condition-specific transcriptional regulatory network by integrating gene expression data and protein-DNA binding information. Protein-DNA binding information provides initial relationships between transcription factors (TFs) and their target genes, and this information is essential to derive biologically meaningful integrative algorithms. Based on the availability of this information, we discuss the inference task based on two different situations: (a) if protein-DNA binding information of multiple TFs is available: based on the protein-DNA data of multiple TFs, which are derived from sequence analysis between DNA motifs and gene promoter regions, we can construct initial connection matrix and solve the network inference using a constraint least-squares approach named motif-guided network component analysis (mNCA). However, connection matrix usually contains a considerable amount of false positives and false negatives that make inference results questionable. To circumvent this problem, we propose a knowledge based stability analysis (kSA) approach to test the conditional relevance of individual TFs, by checking the discrepancy of multiple estimations of transcription factor activity with respect to different perturbations on the connections. The rationale behind stability analysis is that the consistency of observed gene expression and true network connection shall remain stable after small perturbations are applied to initial connection matrix. With condition-specific TFs prioritized by kSA, we further propose to use multivariate regression to highlight condition-specific target genes. Through simulation studies comparing with several competing methods, we show that the proposed schemes are more sensitive to detect relevant TFs and target genes for network inference purpose. Experimentally, we have applied stability analysis to yeast cell cycle experiment and further to a series of anti-estrogen breast cancer studies. In both experiments not only biologically relevant regulators are highlighted, the condition-specific transcriptional regulatory networks are also constructed, which could provide further insights into the corresponding cellular mechanisms. (b) if only single TF's protein-DNA information is available: this happens when protein-DNA binding relationship of individual TF is measured through experiments. Since original mNCA requires a complete connection matrix to perform estimation, an incomplete knowledge of single TF is not applicable for such approach. Moreover, binding information derived from experiments could still be inconsistent with gene expression levels. To overcome these limitations, we propose a linear extraction scheme called regulatory component analysis (RCA), which can infer underlying regulation relationships, even with partial biological knowledge. Numerical simulations show significant improvement of RCA over other traditional methods to identify target genes, not only in low signal-to-noise-ratio situations and but also when the given biological knowledge is incomplete and inconsistent to data. Furthermore, biological experiments on Escherichia coli regulatory network inferences are performed to fairly compare traditional methods, where the effectiveness and superior performance of RCA are confirmed. The second part of the dissertation moves from protein-DNA interaction network up to protein-protein interaction network, to identify dys-regulated protein sub-networks by integrating gene expression data and protein-protein interaction information. Specifically, we propose a statistically principled method, namely Metropolis random walk on graph (MRWOG), to highlight condition-specific PPI sub-networks in a probabilistic way. The method is based on the Markov chain Monte Carlo (MCMC) theory to generate a series of samples that will eventually converge to some desired equilibrium distribution, and each sample indicates the selection of one particular sub-network during the process of Metropolis random walk. The central idea of MRWOG is built upon that the essentiality of one gene to be included in a sub-network depends on not only its expression but also its topological importance. Contrasted to most existing methods constructing sub-networks in a deterministic way and therefore lacking relevance score for each protein, MRWOG is capable of assessing the importance of each individual protein node in a global way, not only reflecting its individual association with clinical outcome but also indicating its topological role (hub, bridge) to connect other important proteins. Moreover, each protein node is associated with a sampling frequency score, which enables the statistical justification of each individual node and flexible scaling of sub-network results. Based on MRWOG approach, we further propose two strategies: one is bootstrapping used for assessing statistical confidence of detected sub-networks; the other is graphic division to separate a large sub-network to several smaller sub-networks for facilitating interpretations. MRWOG is easy to use with only two parameters need to be adjusted, one is beta value for performing random walk and another is Quantile level for calculating truncated posteriori mean. Through extensive simulations, we show that the proposed scheme is not sensitive to these two parameters in a relatively wide range. We also compare MRWOG with deterministic approaches for identifying sub-network and prioritizing topologically important proteins, in both cases MRWG outperforms existing methods in terms of both precision and recall. By utilizing MRWOG generated node/edge sampling frequency, which is actually posteriori mean of corresponding protein node/interaction edge, we illustrate that condition-specific nodes/interactions can be better prioritized than the schemes based on scores of individual node/interaction. Experimentally, we have applied MRWOG to study yeast knockout experiment for galactose utilization pathways to reveal important components of corresponding biological functions; we also applied MRWSOG to study breast cancer patient prognostics problems, where the sub-network analysis could lead to an understanding of the molecular mechanisms of antiestrogen resistance in breast cancer. Finally, we conclude this dissertation with a summary of the original contributions, and the future work for deepening the theoretical justification of the proposed methods and broadening their potential biological applications such as cancer studies.
- Doctoral Dissertations