<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1471-2105-11-162</ui><ji>1471-2105</ji><fm>
<dochead>Methodology article</dochead>
<bibl>
<title>
<p>Knowledge-guided gene ranking by coordinative component analysis</p>
</title>
<aug>
<au id="A1"><snm>Wang</snm><fnm>Chen</fnm><insr iid="I1"/><email>topsoil@vt.edu</email></au>
<au ca="yes" id="A2"><snm>Xuan</snm><fnm>Jianhua</fnm><insr iid="I1"/><email>xuan@vt.edu</email></au>
<au id="A3"><snm>Li</snm><fnm>Huai</fnm><insr iid="I2"/><email>huaili@mail.nih.gov</email></au>
<au id="A4"><snm>Wang</snm><fnm>Yue</fnm><insr iid="I1"/><email>yuewang@vt.edu</email></au>
<au id="A5"><snm>Zhan</snm><fnm>Ming</fnm><insr iid="I2"/><email>zhanmi@grc.nia.nih.gov</email></au>
<au id="A6"><snm>Hoffman</snm><mi>P</mi><fnm>Eric</fnm><insr iid="I3"/><email>ehoffman@cnmcresearch.org</email></au>
<au id="A7"><snm>Clarke</snm><fnm>Robert</fnm><insr iid="I4"/><email>clarker@georgetown.edu</email></au>
</aug>
<insg>
<ins id="I1"><p>Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA</p></ins>
<ins id="I2"><p>Bioinformatics Unit, Research Resources Branch, National Institute on Aging, NIH, Baltimore, MD, USA</p></ins>
<ins id="I3"><p>Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC, USA</p></ins>
<ins id="I4"><p>Departments of Oncology and Physiology &amp; Biophysics, Georgetown University School of Medicine, Washington, DC, USA</p></ins>
</insg>
<source>BMC Bioinformatics</source>
<issn>1471-2105</issn>
<pubdate>2010</pubdate>
<volume>11</volume>
<issue>1</issue>
<fpage>162</fpage>
<url>http://www.biomedcentral.com/1471-2105/11/162</url>
<xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-11-162</pubid><pubid idtype="pmpid">20353603</pubid></pubidlist></xrefbib>
</bibl>
<history><rec><date><day>8</day><month>10</month><year>2009</year></date></rec><acc><date><day>30</day><month>3</month><year>2010</year></date></acc><pub><date><day>30</day><month>3</month><year>2010</year></date></pub></history>
<cpyrt><year>2010</year><collab>Wang et al; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<sec>
<st>
<p>Background</p>
</st>
<p>In cancer, gene networks and pathways often exhibit dynamic behavior, particularly during the process of carcinogenesis. Thus, it is important to prioritize those genes that are strongly associated with the functionality of a network. Traditional statistical methods are often inept to identify biologically relevant member genes, motivating researchers to incorporate biological knowledge into gene ranking methods. However, current integration strategies are often heuristic and fail to incorporate fully the true interplay between biological knowledge and gene expression data.</p>
</sec>
<sec>
<st>
<p>Results</p>
</st>
<p>To improve knowledge-guided gene ranking, we propose a novel method called coordinative component analysis (COCA) in this paper. COCA explicitly captures those genes within a specific biological context that are likely to be expressed in a coordinative manner. Formulated as an optimization problem to maximize the coordinative effort, COCA is designed to first extract the coordinative components based on a partial guidance from knowledge genes and then rank the genes according to their participation strengths. An embedded bootstrapping procedure is implemented to improve statistical robustness of the solutions. COCA was initially tested on simulation data and then on published gene expression microarray data to demonstrate its improved performance as compared to traditional statistical methods. Finally, the COCA approach has been applied to stem cell data to identify biologically relevant genes in signaling pathways. As a result, the COCA approach uncovers novel pathway members that may shed light into the pathway deregulation in cancers.</p>
</sec>
<sec>
<st>
<p>Conclusion</p>
</st>
<p>We have developed a new integrative strategy to combine biological knowledge and microarray data for gene ranking. The method utilizes knowledge genes for a guidance to first extract coordinative components, and then rank the genes according to their contribution related to a network or pathway. The experimental results show that such a knowledge-guided strategy can provide context-specific gene ranking with an improved performance in pathway member identification.</p>
</sec>
</sec>
</abs>
</fm><bdy>
<sec>
<st>
<p>Background</p>
</st>
<p>It is of great interest to identify genes strongly associated with the functionality of gene networks or signal transduction pathways particularly from gene expression microarray data. Two of the earliest approaches to identify such genes are fold-change and multiple t-testing; each aims to rank the genes in the order of their differential expressions under various experimental conditions. Many improvements to the original t-test method have been proposed for microarray data analysis. For example, significant analysis of microarray (SAM) <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> uses a modified t-statistic with an added estimator for gene ranking in which the false discovery rate (FDR) is estimated by a permutation procedure. A bootstrapped p-value approach was introduced in <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp> to address the inherent variability in small sample studies. Prior studies have shown that fold-change is more robust than t-test with respect to the reproducibility of gene rankings <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>, while other researchers argue that better reproducibility does not guarantee the accuracy of gene ranking<abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp>. Nonetheless, both methods are severely limited because they neglect the interaction among genes, prioritizing gene relevance only based on individual gene expression values.</p>
<p>To address the above-mentioned problem, several gene ranking methods have been proposed to either consider the joint effect of genes or to explore the expression pattern in time-course data. For instance, Opgen-Rhein &amp; Strimmer <abbrgrp>
<abbr bid="B5">5</abbr>
</abbrgrp> introduced the "shrinkage t" statistic that is based on a novel and model-free shrinkage estimate of the variance vector across genes. Storey <it>et al. </it>
<abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp> proposed a method (EDGE) to first fit the time-course expression pattern by splines, and then rank genes by hypothesis testing on the spline parameters. Furlanello <it>et al. </it>
<abbrgrp>
<abbr bid="B7">7</abbr>
</abbrgrp> proposed a classification-based feature elimination scheme to rank genes by iteratively discarding chunks of genes showing least contribution to the classifier.</p>
<p>In contrast, other investigators have proposed incorporating biological knowledge for gene ranking. GeneRank <abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp> ranks genes by integrating gene expression and network structure derived from gene annotations. Ma <it>et al. </it>
<abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp> proposed a strategy to combine gene expression and protein-protein interaction (PPI) knowledge, ranking genes by their association with phenotype calibrated by the PPI information. However, such data integration, while widely adopted, is usually done in a heuristic way and lacks an objective estimate of the true interplay between biological knowledge and gene expression data.</p>
<p>In this paper, we propose a knowledge-guided gene ranking scheme, namely a coordinative component analysis (COCA) algorithm, to model explicitly those genes that are most likely to be expressed in a coordinative manner within a specific biological context. We consider the genes that belong to a pathway or a network as a whole, rather than treating genes as independent or individual measures. To enhance the biological relevance of gene ranking, gene organization requires that the intrinsic coordination among the genes be defined by biological knowledge. Specifically, biological knowledge, which could be the gene sets within a biological pathway or sub-network derived from relevant biological databases, is used to guide the algorithm. Thus, we can address the conditional specificity of biological context, for example, where the deregulation of a network only occurs under specific conditions. We rank each individual gene by evaluating its participation or involvement in the pathway of interest, when projected onto the coordinative direction learned by the COCA algorithm. In COCA, a bootstrapping procedure is also implemented to improve the statistical robustness of the ranking results. We demonstrate that the COCA approach can provide an improved performance as compared to traditional statistical methods using simulation data and published gene expression microarray data including yeast cell cycle data and stem cell time-course data, indicating its effectiveness for incorporating biological knowledge into gene ranking.</p>
</sec>
<sec>
<st>
<p>Methods</p>
</st>
<p>A flowchart of the proposed approach is shown in Figure <figr fid="F1">1</figr>. Given a gene expression microarray data set, multiple data sets are first generated through bootstrap resampling of the genes in the array. The bootstrapping procedure is used to overcome the over-fitting problem associated with a small sample size relative to the very high dimensionality of the primary data <abbrgrp>
<abbr bid="B10">10</abbr>
<abbr bid="B11">11</abbr>
</abbrgrp>. Each bootstrap sampled data set is then analyzed by the proposed COCA algorithm. COCA aims to learn a coordinative direction by integrating biological knowledge and gene expression data, with which the knowledge is maximally aligned along the coordinative direction. The involvement of each gene in the knowledge network or pathway is estimated from a projection onto the coordinative direction. Finally, multiple bootstrapped estimates of the involvement are merged to create the gene ranking. Note that the COCA software package is made available at the following link: <url>http://www.cbil.ece.vt.edu/software.htm</url>.</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>A flowchart of the proposed approach, namely knowledge-guided coordinative component analysis (COCA), for gene ranking</p></caption><text>
   <p><b>A flowchart of the proposed approach, namely knowledge-guided coordinative component analysis (COCA), for gene ranking</b>. A bootstrapping procedure is designed to increase the confidence in estimating the coordinative component (W) and participation vector (A).</p>
</text><graphic file="1471-2105-11-162-1"/></fig>
<sec>
<st>
<p>Coordinative component analysis (COCA)</p>
</st>
<p>Linear latent variable models are widely used in microarray data analysis, reflecting their simplicity and parsimonious characteristics <abbrgrp>
<abbr bid="B12">12</abbr>
</abbrgrp>. In a linear model, gene expressions are represented as the sum of a relatively small number of biological functions (biological processes or signaling pathways or networks) <abbrgrp>
<abbr bid="B9">9</abbr>
<abbr bid="B13">13</abbr>
</abbrgrp>:</p>
<p>
<display-formula id="M1">
<graphic file="1471-2105-11-162-i1.gif"/>
</display-formula>
</p>
<p>where <b>X </b>&#8712; &#8477;<sup>
<it>N </it>&#215; <it>M </it>
</sup>is the mRNA expression matrix consisting of <it>M </it>microarray samples with <it>N </it>genes. <b>A </b>&#8712; &#8477;<sup>
<it>N </it>&#215; <it>L </it>
</sup>is the participation or involvement matrix in which each element <it>a</it>
<sub>
<it>ji </it>
</sub>represents the participation relationship from gene <it>j </it>to biological process <it>i </it>(i.e., how likely gene <it>j </it>is involved in biological process <it>i</it>). <b>T </b>&#8712; &#8477;<sup>
<it>L </it>&#215; <it>M </it>
</sup>contains the latent or hidden activities of biological processes. Given the model as in Eq. (1), several decomposition methods have been proposed to infer <b>A </b>and <b>T </b>from the mRNA expression profile <b>X </b>under certain statistical assumptions <abbrgrp>
<abbr bid="B9">9</abbr>
<abbr bid="B14">14</abbr>
<abbr bid="B15">15</abbr>
<abbr bid="B16">16</abbr>
</abbrgrp> or biological knowledge constraints <abbrgrp>
<abbr bid="B17">17</abbr>
<abbr bid="B18">18</abbr>
<abbr bid="B19">19</abbr>
<abbr bid="B20">20</abbr>
</abbrgrp>. For example, nonnegative matrix factorization (NMF) imposes the non-negativity constraint on both <b>A </b>and <b>T </b>for gene module identification <abbrgrp>
<abbr bid="B14">14</abbr>
<abbr bid="B15">15</abbr>
</abbrgrp>; independent component analysis (ICA) assumes the independence of biological processes for a sparse decomposition of gene expression <abbrgrp>
<abbr bid="B9">9</abbr>
<abbr bid="B16">16</abbr>
<abbr bid="B21">21</abbr>
</abbrgrp>; network component analysis (NCA) incorporates the protein-DNA binding information to constrain the network topology for a reliable estimation of <b>A </b>and <b>T </b>
<abbrgrp>
<abbr bid="B17">17</abbr>
<abbr bid="B18">18</abbr>
</abbrgrp>. Despite some apparent success, it remains a difficult task to infer <it>biologically plausible </it>
<b>A </b>and <b>T </b>from <b>X</b>, mainly due to the complexity of biological systems, the noise in gene expression data <b>X</b>, and the incompleteness of current biological knowledge. For example, while the DNA binding of transcript factors (TFs) with high affinity is a more reliable predictor of TF activity than low affinity binding (which are often ignored), studies also showed that low affinity TF-DNA binding can be both evolutionarily and functionally important <abbrgrp>
<abbr bid="B22">22</abbr>
</abbrgrp>.</p>
<p>In this paper, we address the above-mentioned problem from a different perspective in the context of gene ranking, where network or pathway knowledge is incorporated to guide a COCA approach for inferring the involvement of member genes. In COCA, we apply a linear filtering procedure to extract a particular column of the involvement matrix <b>A </b>from <b>X </b>by <it>A</it>
<sub>
<it>i </it>
</sub>= <b>X</b>
<it>W</it>
<sub>
<it>i</it>
</sub>. As designed, <it>A</it>
<sub>
<it>i </it>
</sub>&#8712; &#8477;<sup>
<it>N </it>
</sup>denotes a participation vector of the <it>i</it>-th biological function (a term that can be referred to as biological process, network or pathway in this paper), and its element <it>a</it>
<sub>
<it>j </it>
</sub>represents the relationship of biological function <it>i </it>to gene <it>j</it>. We want to find an optimal <it>W</it>
<sub>
<it>i </it>
</sub>such that <it>A</it>
<sub>
<it>i </it>
</sub>is coordinately expressed with the pathway or network knowledge genes. To optimize the linear filter <it>W</it>
<sub>
<it>i </it>
</sub>for a specific pathway or network, the following cost function is used to fulfill the requirement of achieving maximum coordination of member genes:</p>
<p>
<display-formula id="M2">
<graphic file="1471-2105-11-162-i2.gif"/>
</display-formula>
</p>
<p>where <inline-formula>
<graphic file="1471-2105-11-162-i3.gif"/>
</inline-formula> is the <it>j</it>-th row vector of <b>X</b>, and subscript <it>p </it>refers to the <it>p</it>-norm. <it>W</it>
<sub>
<it>i </it>
</sub>&#8712; &#8477;<sup>
<it>M </it>
</sup>can be interpreted conceptually as the coordinative direction of the <it>i</it>-th biological function.</p>
<p>To incorporate prior knowledge in Eq. (2), we define a positive masking vector <inline-formula>
<graphic file="1471-2105-11-162-i4.gif"/>
</inline-formula> for the <it>i</it>-th biological function, where <inline-formula>
<graphic file="1471-2105-11-162-i5.gif"/>
</inline-formula> indicating the <it>j</it>-th gene is likely to be involved in the <it>i</it>-th biological function, and <inline-formula>
<graphic file="1471-2105-11-162-i6.gif"/>
</inline-formula> suggesting otherwise. Conversely, <inline-formula>
<graphic file="1471-2105-11-162-i7.gif"/>
</inline-formula> is a negative masking vector, where <inline-formula>
<graphic file="1471-2105-11-162-i8.gif"/>
</inline-formula> suggests there is no evidence for the <it>j</it>-th gene's involvement in the <it>i</it>-th biological function, and <inline-formula>
<graphic file="1471-2105-11-162-i9.gif"/>
</inline-formula> suggesting otherwise.</p>
<p>Note that different settings for the parameter <it>p </it>in Eq. (2) can lead to different versions of COCA; for example, the norm-2 case (<it>p </it>= 2) emphasizes the coordinative behavior of member genes in terms of their energy, while the norm-1 case (<it>p </it>= 1) uses their absolute amplitude. From our experiments with microarray data, norm-1 is generally less affected by outliers than norm-2, whereas norm-2 tends to amplify the influence of outliers. Therefore, we use the norm-1 version of Eq. (2) as our default COCA approach. Rewriting Eq. (2) in the norm-1 form, we have the following cost function of a linear projection <it>W</it>
<sub>
<it>i </it>
</sub>to maximize:</p>
<p>
<display-formula id="M3">
<graphic file="1471-2105-11-162-i10.gif"/>
</display-formula>
</p>
<p>We can maximize the cost function <it>J</it>
<sub>1</sub>(<it>W</it>
<sub>
<it>i</it>
</sub>) using a gradient-based learning approach, specifically, by updating <it>W</it>
<sub>
<it>i </it>
</sub>to follow its gradient direction:</p>
<p>
<display-formula id="M4">
<graphic file="1471-2105-11-162-i11.gif"/>
</display-formula>
</p>
<p>Recall that <it>W</it>
<sub>
<it>i </it>
</sub>is a vector of size <it>M </it>(the number of microarray samples) and let us explicitly denote <it>W</it>
<sub>
<it>i </it>
</sub>into a vector form as <it>W</it>
<sub>
<it>i</it>
</sub>(<it>n</it>) = [<it>w</it>
<sub>1<it>i</it>
</sub>(<it>n</it>), &#8943;, <it>w</it>
<sub>
<it>ki</it>
</sub>(<it>n</it>), &#8943;, <it>w</it>
<sub>
<it>Mi</it>
</sub>(<it>n</it>)]<sup>
<it>T</it>
</sup>.</p>
<p>Then, the gradient of <it>J</it>
<sub>1</sub>(<it>W</it>
<sub>
<it>i</it>
</sub>) can be calculated by the following equation:</p>
<p>
<display-formula id="M5">
<graphic file="1471-2105-11-162-i12.gif"/>
</display-formula>
</p>
<p>Since it is mathematically difficult to obtain the analytical form of Eq. (5), we use a simultaneous perturbation technique to approximate the gradient <abbrgrp>
<abbr bid="B23">23</abbr>
</abbrgrp>:</p>
<p>
<display-formula id="M6">
<graphic file="1471-2105-11-162-i13.gif"/>
</display-formula>
</p>
<p>In Eq. (6), <it>c </it>is a small positive constant controlling the degree of perturbation, and <it>S</it>(<it>n</it>) = [<it>s</it>
<sub>1 </sub>(<it>n</it>), &#8943;, <it>s</it>
<sub>
<it>k </it>
</sub>(<it>n</it>), &#8943;, <it>s</it>
<sub>
<it>M </it>
</sub>(<it>n</it>)]<sup>
<it>T </it>
</sup>is a simultaneous perturbation vector. Each element of <it>S(n) </it>was draw independently from a binary discrete random distribution taking +1 or -1 for values, with a probability of 0.5 for each value. The gradient form in Eq. (6) is also known as the "stochastic gradient", which is particularly useful when there is no analytical form for the derivative of a cost function. Moreover, when multiple local maxima (or "peak" points) exist in the solution space, the stochastic gradient can help the learning algorithm jump out of these undesirable solution points that may entrap the deterministic gradient.</p>
</sec>
<sec>
<st>
<p>Bootstrapping the COCA approach for variability analysis</p>
</st>
<p>In practice, the typical size of a knowledge gene set is about a few hundreds, which is much smaller than the number of background genes, which can be several thousands in microarray data. One concern with such an imbalanced comparison is that it will almost inevitably lead to over-fitting. To address this problem, we incorporated a bootstrapping procedure into the COCA approach (see Figure <figr fid="F1">1</figr>). Bootstrapping is a computer-intensive method to generate many 'virtual' samples (called bootstrap samples) by the re-sampling with replacement technique. By applying some estimator on these bootstrap samples, one can calculate a number of statistics of this estimator, such as confidence interval, standard error, etc. Moreover, the averaging of estimations on bootstrap samples can also improve the stability of a model and avoid the over-fitting of the model. This strategy is known as bootstrap aggregating ('bagging') <abbrgrp>
<abbr bid="B24">24</abbr>
</abbrgrp> and has been widely used in many machine learning applications such as classification <abbrgrp>
<abbr bid="B25">25</abbr>
</abbrgrp> and clustering <abbrgrp>
<abbr bid="B26">26</abbr>
</abbrgrp>. Here, we mainly utilize the 'bagging' scheme to reduce the variance of COCA estimation. In practice, the background genes are re-sampled multiple times to form bootstrap samples, each with a comparable size of the knowledge genes. For each bootstrap sample <b>X</b>*<sup>
<it>b</it>
</sup>, <it>b </it>= 1, &#8943;, <it>B</it>, where <it>B </it>is the total number of bootstrapping, COCA was applied to estimate the corresponding coordinative direction <it>W</it>*<it>b</it>, and participation vector <it>A</it>*<it>b </it>= <b>X</b>
<it>W </it>*<it>b</it>. After ambiguity correction (see Additional file <supplr sid="S1">1</supplr>: Section S3], for more details), we can obtain 'bagging' aggregated estimations of <it>W </it>and <it>A </it>using {<it>W</it>*<sup>
<it>b</it>
</sup>}<sub>
<it>b </it>= 1, &#8943;, <it>B </it>
</sub>and {<it>A</it>*<sup>
<it>b</it>
</sup>}<sub>
<it>b </it>= 1, &#8943;, <it>B</it>
</sub>, respectively. Finally, we used the absolute value of 'bagging' aggregated participation vector to rank genes. The larger the absolute participation value of a gene, the higher the gene was ranked.</p>
<suppl id="S1">
<title>
<p>Additional file 1</p>
</title>
<text>
<p>
<b>Supplementary information of the COCA method</b>. The supplementary information includes a geometrical interpretation of the method, the concept of linear extraction and ambiguity correction, and supplementary results of yeast cell cycle and stem cell studies.</p>
</text>
<file name="1471-2105-11-162-S1.PDF">
   <p>Click here for file</p>
</file>
</suppl>
</sec>
</sec>
<sec>
<st>
<p>Results</p>
</st>
<sec>
<st>
<p>Simulation data</p>
</st>
<p>We first applied the proposed COCA approach to simulation data to assess its likely feasibility. Performance of COCA in gene ranking was compared with other methods to demonstrate the improvement. In the simulation of one-condition case, 8 samples were generated according to Eq. (1) with 5 biological processes, each sample consisting of expression measurements of 5,000 genes. For partial knowledge guidance, we input 50 genes to the COCA algorithm, randomly selected from the 200 top ranked genes (called 'ground truth' genes hereafter) of one biological process. In such, COCA incorporated the partial knowledge (from the 50 genes) and set to find the other true knowledge genes (i.e., the remaining 150 'ground truth' genes). We further added a noise component to Eq. (1) to simulate the measurements with different signal-to-noise ratios (SNRs), resulting in a gradual decrease of SNR from 10 dB to -10 dB. Performance of the algorithm was evaluated by its accuracy in finding the genes regulated by the biological process; accuracy is defined as the ratio of the number of 'ground truth' genes identified by the algorithm to the total number of "ground truth" genes, when the genes with the same number as "ground truth" genes were selected for each method. Experimental results from this simulation study are shown in Figure <figr fid="F2">2(a)</figr> that includes a performance comparison with variance-based ranking (VR), an unsupervised method that ranks genes according to their variances. The proposed COCA outperforms VR when SNR is relatively large. When SNR is low (-6 db to -10 db), performance converges to that of a random guess.</p>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>Performance comparison using simulation data as measured by accuracy vs. signal-to-noise ratio (SNR)</p></caption><text>
   <p><b>Performance comparison using simulation data as measured by accuracy vs. signal-to-noise ratio (SNR)</b>. (a) Comparison of COCA and variance-based ranking (VR) for one-condition case, showing random guess as a baseline. (b) Comparison of COCA, fold-change and SAM for two-condition case, taking random guess as a baseline.</p>
</text><graphic file="1471-2105-11-162-2"/></fig>
<p>Simulations of the two-condition case were also performed. For each condition, 20 samples were generated according to a linear model (Eq. (1)) with 5 biological processes, each sample consisting of expression measurements of 10,000 genes. The difference between the two conditions is that 100 genes, regulated by one biological process in the first condition, were taken out or eliminated in the second condition. Mathematically, let us denote the participation matrices under two conditions as <b>A</b>
<sub>
<it>cond</it>1 </sub>= [<it>A</it>
<sub>1</sub>, <it>A</it>
<sub>2</sub>, <it>A</it>
<sub>3</sub>, <it>A</it>
<sub>4</sub>, <it>A</it>
<sub>5</sub>] and <b>A</b>
<sub>
<it>cond</it>2 </sub>= [<inline-formula>
<graphic file="1471-2105-11-162-i14.gif"/>
</inline-formula>, <it>A</it>
<sub>2</sub>, <it>A</it>
<sub>3</sub>, <it>A</it>
<sub>4</sub>, <it>A</it>
<sub>5</sub>], respectively; except that 100 non-zero items in <it>A</it>
<sub>1 </sub>were set to be zero in <inline-formula>
<graphic file="1471-2105-11-162-i14.gif"/>
</inline-formula>, the items in <it>A</it>
<sub>1 </sub>are same as those in <inline-formula>
<graphic file="1471-2105-11-162-i14.gif"/>
</inline-formula>. Therefore, these 100 'ground-truth' genes are the targets to be detected by the algorithm. For COCA, 50 knowledge genes (not including any of the 100 'ground-truth' genes) are randomly chosen to provide a guidance for the algorithm to find the 100 'ground truth' genes. Similar to the one-condition case, SNR is gradually decreased from 10 dB to -10 dB. Again, performance of the algorithm was evaluated in terms of its accuracy in finding the 'ground-truth' genes; accuracy is defined as the number of detected 'ground-truth' genes among the top ranked 100 genes divided by the total number of 'ground-truth' genes (100 in this case). Figure <figr fid="F2">2(b)</figr> shows the detection accuracies for COCA, fold-change and SAM <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>, respectively. COCA outperforms both fold-change and SAM when SNR is higher than -6 dB. For the case of SNR below -6 dB, performances of all three approaches converge to a point that a random guess is equally good. It is worth noting that our COCA approach is designed to detect the changes occurred in the latent level (i.e., the biological process level), while fold-change and SAM approaches are intended to mainly detect the changes in the observation level (i.e., the gene expression level). This major difference can also be appreciated from this simulation study; as seen in Figure <figr fid="F2">2(b)</figr>, the performance of COCA remains superior as SNR decreases from 10 dB to 0 dB, while the performance of fold-change or SAM degrades substantially.</p>
</sec>
<sec>
<st>
<p>Yeast cell cycle data</p>
</st>
<p>We then applied the COCA approach to yeast cell cycle data to identify the genes involved in cell cycle. The yeast cell cycle microarray experiment was performed using fluorescently labeled cDNA arrays, measuring the expression levels of 6178 genes of wild-type <it>S. cerevisiae </it>cultures. The cell cycle was synchronized by three independent methods: firstly &#945;-pheromone (&#945;-factor) was used to arrest the cells in G1 phase; secondly centrifugal elutriation was used to obtain small G1 cells; finally, a temperature-sensitive mutation <it>cdc15-2 </it>was utilized to arrest cell in mitosis. In our study, we used 59 cDNA samples from these three synchronization experiments <abbrgrp>
<abbr bid="B27">27</abbr>
</abbrgrp>. About 800 genes were identified to be periodically expressed during the cell cycle, which can be further grouped into five subsets related to cell cycle phases M/G1, G1, S, G2 and M <abbrgrp>
<abbr bid="B27">27</abbr>
</abbrgrp>. In this study, we used these five subsets of genes to further demonstrate the importance of coordinative components in the COCA approach. The total numbers of genes in five subsets (corresponding to M/G1, G1, S, G2 and M) are 113, 120, 196, 300 and 71, respectively. For each phase, 20 genes were randomly selected as knowledge genes to guide the COCA approach. After finding the coordinative component, gene expressions of all genes were projected onto the component for ranking.</p>
<p>To objectively evaluate the performance, receiver operator characteristic (ROC) analysis was conducted to obtain the sensitivity and specificity of the algorithm. Two other approaches were also implemented for a comparison study; the first one is the VR approach that ranks the genes according to their variances; the second one is a supervised approach, which uses principal component analysis (PCA) to first find the principal component of given knowledge genes, and then all the genes are ranked according to their absolute correlations with the principal component. The comparison results are shown in Figure <figr fid="F3">3</figr> for G1 and M phases; the complete results for all the phases can be found in the supplemental figures [Additional file <supplr sid="S1">1</supplr>: Figures S1 - S3]. The areas under ROC curves (AUCs) are summarized in Table <tblr tid="T1">1</tblr> for all the cell cycle phases under different synchronization methods. Both COCA and PCA-based approaches substantially outperform VR. The VR approach suffers from the lack of knowledge guidance, hence, showing poor performance. More importantly, the COCA approach outperforms the PCA-based approach for all cell cycle phases, since it is the coordinative component (not the principal component) that reflects the underlying regulatory mechanism in yeast cell cycle.</p>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>Receiver Operator Characteristic (ROC) curves of COCA to rank yeast cell cycle-related genes in (a) G1 phase and (b) M phase as synchronized by CDC15</p></caption><text>
   <p><b>Receiver Operator Characteristic (ROC) curves of COCA to rank yeast cell cycle-related genes in (a) G1 phase and (b) M phase as synchronized by CDC15</b>. The ROC curves of other phases can be found in the Additional files [Additional file <supplr sid="S1">1</supplr>].</p>
</text><graphic file="1471-2105-11-162-3"/></fig>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>Performance comparison of COCA, PCA-based and variance-based ranking (VR) approaches.</p></caption><tblbdy cols="10">
      <r>
         <c>
            <p/>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>Alpha-factor arrest</b>
            </p>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>CDC15 arrest</b>
            </p>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>CDC28 arrest</b>
            </p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c cspan="9">
            <hr/>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>COCA</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>PCA-based</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>VR</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>COCA</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>PCA-based</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>VR</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>COCA</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>PCA-based</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>VR</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>M/G1</p>
         </c>
         <c ca="center">
            <p>
               <b>0.8477</b>
            </p>
         </c>
         <c ca="center">
            <p>0.8277</p>
         </c>
         <c ca="center">
            <p>0.5685</p>
         </c>
         <c ca="center">
            <p>
               <b>0.9045</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7594</p>
         </c>
         <c ca="center">
            <p>0.5854</p>
         </c>
         <c ca="center">
            <p>
               <b>0.7904</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7661</p>
         </c>
         <c ca="center">
            <p>0.6524</p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>G2</p>
         </c>
         <c ca="center">
            <p>
               <b>0.8182</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7172</p>
         </c>
         <c ca="center">
            <p>0.6523</p>
         </c>
         <c ca="center">
            <p>
               <b>0.8888</b>
            </p>
         </c>
         <c ca="center">
            <p>0.6979</p>
         </c>
         <c ca="center">
            <p>0.5547</p>
         </c>
         <c ca="center">
            <p>
               <b>0.8036</b>
            </p>
         </c>
         <c ca="center">
            <p>0.6767</p>
         </c>
         <c ca="center">
            <p>0.7418</p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>M</p>
         </c>
         <c ca="center">
            <p>
               <b>0.7731</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7537</p>
         </c>
         <c ca="center">
            <p>0.6705</p>
         </c>
         <c ca="center">
            <p>
               <b>0.8873</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7365</p>
         </c>
         <c ca="center">
            <p>0.5572</p>
         </c>
         <c ca="center">
            <p>
               <b>0.8448</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7585</p>
         </c>
         <c ca="center">
            <p>0.5685</p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>G1</p>
         </c>
         <c ca="center">
            <p>
               <b>0.9123</b>
            </p>
         </c>
         <c ca="center">
            <p>0.821</p>
         </c>
         <c ca="center">
            <p>0.6611</p>
         </c>
         <c ca="center">
            <p>
               <b>0.9172</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7119</p>
         </c>
         <c ca="center">
            <p>0.5521</p>
         </c>
         <c ca="center">
            <p>
               <b>0.9032</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7524</p>
         </c>
         <c ca="center">
            <p>0.792</p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>S</p>
         </c>
         <c ca="center">
            <p>
               <b>0.8763</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7641</p>
         </c>
         <c ca="center">
            <p>0.7054</p>
         </c>
         <c ca="center">
            <p>
               <b>0.9478</b>
            </p>
         </c>
         <c ca="center">
            <p>0.6867</p>
         </c>
         <c ca="center">
            <p>0.6385</p>
         </c>
         <c ca="center">
            <p>
               <b>0.8532</b>
            </p>
         </c>
         <c ca="center">
            <p>0.7058</p>
         </c>
         <c ca="center">
            <p>0.799</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>Area under ROC curves (AUCs) are summarized in the table for ranking cell cycle-related genes in five yeast cell cycle phases, i.e., M/G1, G1, S, G2 and M, respectively</p>
   </tblfn></tbl>
</sec>
<sec>
<st>
<p>Embryonic stem cell data</p>
</st>
<p>Understanding the molecular mechanisms controlling self-renewal and differentiation in embryonic stem cells (ESCs) is of central importance towards realizing their potential in medicine and science <abbrgrp>
<abbr bid="B28">28</abbr>
<abbr bid="B29">29</abbr>
</abbrgrp>. ESCs serve as a model system for studying cell development and have considerable potential in cancer research and for improving cancer treatments. Most studies on ESC transcriptomes have primarily used fold changes of individual genes to identify the molecular signatures of ESCs for elucidating the mechanisms controlling pluripotency <abbrgrp>
<abbr bid="B30">30</abbr>
<abbr bid="B31">31</abbr>
<abbr bid="B32">32</abbr>
</abbrgrp>. Here, we used the COCA algorithm to infer biologically relevant genes in ESC-critical pathways including Notch, JAK/STAT, TGF&#946; and WNT pathways <abbrgrp>
<abbr bid="B31">31</abbr>
</abbrgrp>.</p>
<p>The mouse embryonic stem cell data sets that we used were acquired from <abbrgrp>
<abbr bid="B33">33</abbr>
</abbrgrp>. The original research aimed to study the genetic determinants of mouse embryonic stem cell (mESC) differentiation. The transition from mESC to embryoid body (EB) was initialized by removing leukemia inhibitory factor (LIF) and making murine embryonic feeder cells absent. The data that we used was measured on R1 cell line at 11-point time series over a period of two weeks (0 h - undifferentiated mESCs, 6 h, 12 h, 18 h, 24 h, 36 h, 48 h, 4 d, 7 d, 9 d, and 14 d), with three replicates at each time point (GEO database accession number: GSE2972). In our study, we only used 33 samples measured by Affymetrix MOE430A GeneChip set, because the MOE430A array measures genes that are generally better characterized than those on MOE430B and has much better signal quality than MOE430B in terms of false discovery rate of significantly changed probe sets <abbrgrp>
<abbr bid="B33">33</abbr>
</abbrgrp>.</p>
<p>In the study, 5,000 genes were randomly sampled as the background genes for bootstrapping, and one hundred bootstrap iterations were carried out to estimate the variability and then to perform gene ranking. For the COCA approach, pathway related genes were selected as knowledge genes to guide finding the coordinative component. After finding the coordinative component, gene expressions of all the genes were projected onto the component for ranking.</p>
<p>For each pathway analysis, we generated a gene list of top 500 probe sets ranked by COCA, and conducted pathway and functional enrichment analysis using DAVID <abbrgrp>
<abbr bid="B34">34</abbr>
</abbrgrp>
<url>http://david.abcc.ncifcrf.gov/</url>. The results of GO enrichment analysis are listed in Table <tblr tid="T2">2</tblr> for the Notch pathway; the results of enrichment analysis of other pathways (i.e., JAK/STAT, TGF&#946; and WNT pathways) and the detailed gene lists can be found in the Supplemental Tables S1 [Additional file <supplr sid="S1">1</supplr>], S3 - S6 [Additional files <supplr sid="S2">2</supplr>, <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr> and <supplr sid="S5">5</supplr>]. Taking the results of Notch pathway as an example, we can see from Figure <figr fid="F4">4</figr> that COCA effectively boosts the ranking of pathway-relating gene set, as compared to conventional approaches like VR and the EDGE <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>. Once the coordinative direction is estimated, we can discover weakly expressed but related genes. While it is well known that many downstream genes have large variation, COCA can boost the ranking of genes with smaller variation but larger participation value. From pathway enrichment analysis, we can see that VR mainly prioritizes ribosome, cell adhesion and metabolic pathways (Table S7), which are more likely the downstream of stem cell development. The EDGE-based ranking prioritizes the pathways related to cell communication, focal adhesion and ECM-receptor interaction (Table S8). On the other hand, COCA-based ranking prioritizes many upstream pathways (Table <tblr tid="T2">2</tblr>), especially several signaling pathways that might be the cause of those downstream pathways identified by VR. The gene list obtained from Notch pathway-guided COCA includes a notch receptor (NOTCH3) and three ligands (DSL1, JAG1 and JAG2) that can potentially bind to the notch receptor (Figure <figr fid="F5">5</figr>); the list also includes APH-1, a gene encoding a multipass membrane protein, which is required for notch pathway signaling; besides, the list includes many transcription factors as the Notch target genes, revealing a signaling cascade to modulate cell fate by further regulating downstream gene expression. For example, SOX2 in the list is a transcription factor closely related to notch pathway in the development of inner ear <abbrgrp>
<abbr bid="B35">35</abbr>
</abbrgrp> and neocortex <abbrgrp>
<abbr bid="B36">36</abbr>
</abbrgrp>. While functional enrichment analysis gives us a global picture of that top COCA-ranked genes tend to have better function over-representation than those ranked by VR or EDGE, we also performed Gene Set Enrichment Analysis (GSEA) <abbrgrp>
<abbr bid="B37">37</abbr>
</abbrgrp> on the ranked gene lists to further examine whether the ranking can promote the knowledge gene set significantly. In this study we used a web tool, GeneTrail <abbrgrp>
<abbr bid="B38">38</abbr>
</abbrgrp>, for the GSEA analysis, where false discovery rate (FDR) was used to correct for multiple hypothesis testing (the FDR threshold was set as 10%). We also set the minimum gene number as 10 in order to avoid finding too small sized gene sets. We can see from the results (Table <tblr tid="T3">3</tblr> and Table S2(a)-(c)) that COCA ranking tends to boost signaling pathways to be ranked relatively high, while variance-based ranking (VR) mainly boosts ribosome, metabolic pathway and other downstream biological processes (Table <tblr tid="T4">4</tblr>). None of the signaling pathways from the COCA approach is shown in the GSEA results from the VR approach. We also noticed that the JAK-STAT pathway (GSEA FDR = 0.077) was ranked relatively lower than all the other pathways (GSEA FDR = 0.013, 9.71E-05, 0.042 for Notch, TGF-beta and WNT, respectively). To understand this, we looked further into the GSEA results from the VR approach, and found that JAK-STAT member genes were significantly enriched at the bottom of the VR ranking list (FDR = 0.0279572), suggesting that most of JAK-STAT member genes have lower expression change (thus, relatively weak signal). That could explain, or at least in part, why JAK-STAT pathway was ranked lower than the other pathways (i.e., Notch, TGF-beta and WNT pathways).</p>
<suppl id="S2">
<title>
<p>Additional file 2</p>
</title>
<text>
<p>
<b>The top 500 probe sets ranked by Notch pathway-guided COCA approach.</b>
</p>
</text>
<file name="1471-2105-11-162-S2.PDF">
   <p>Click here for file</p>
</file>
</suppl>
<suppl id="S3">
<title>
<p>Additional file 3</p>
</title>
<text>
<p>
<b>The top 500 probe sets ranked by JAK/STAT pathway-guided COCA approach.</b>
</p>
</text>
<file name="1471-2105-11-162-S3.PDF">
   <p>Click here for file</p>
</file>
</suppl>
<suppl id="S4">
<title>
<p>Additional file 4</p>
</title>
<text>
<p>
<b>The top 500 probe sets ranked by TGF&#946; pathway-guided COCA approach.</b>
</p>
</text>
<file name="1471-2105-11-162-S4.PDF">
   <p>Click here for file</p>
</file>
</suppl>
<suppl id="S5">
<title>
<p>Additional file 5</p>
</title>
<text>
<p>
<b>The top 500 probe sets ranked by WNT pathway-guided COCA approach.</b>
</p>
</text>
<file name="1471-2105-11-162-S5.PDF">
   <p>Click here for file</p>
</file>
</suppl>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>A boxplot of the ranking of Notch pathway probe sets by Notch pathway-guided COCA, as compared to those by variance-based ranking (VR) and EDGE-based ranking, respectively</p></caption><text>
   <p><b>A boxplot of the ranking of Notch pathway probe sets by Notch pathway-guided COCA, as compared to those by variance-based ranking (VR) and EDGE-based ranking, respectively</b>.</p>
</text><graphic file="1471-2105-11-162-4"/></fig>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>The identified Notch pathway including several growth factors, transcription factors and oncogenes</p></caption><text>
   <p><b>The identified Notch pathway including several growth factors, transcription factors and oncogenes</b>. Some of the members (e.g., NOTCH3, JAG1, JAG2 and SOX2) are known to be associated with the Notch pathway while several novel members are revealed by the COCA approach, e.g., transcription factors: TCF4, TBP and PITX2; oncogenes: MYCN, FGFR1 and CCND1.</p>
</text><graphic file="1471-2105-11-162-5"/></fig>
<tbl id="T2"><title><p>Table 2</p></title><caption><p>Enriched pathways in the top 500 probe sets ranked by Notch pathway-guided COCA approach</p></caption><tblbdy cols="5">
      <r>
         <c ca="left">
            <p>
               <b>Pathway Term</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Count</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>%</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>p-value</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>FDR</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Pentose and glucuronate interconversions</p>
         </c>
         <c ca="left">
            <p>8</p>
         </c>
         <c ca="center">
            <p>1.70%</p>
         </c>
         <c ca="left">
            <p>1.64E-06</p>
         </c>
         <c ca="center">
            <p>0.00001847</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Notch signaling pathway</b>
            </p>
         </c>
         <c ca="left">
            <p>8</p>
         </c>
         <c ca="center">
            <p>1.70%</p>
         </c>
         <c ca="left">
            <p>5.65E-04</p>
         </c>
         <c ca="center">
            <p>0.00640122</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Porphyrin and chlorophyll metabolism</p>
         </c>
         <c ca="left">
            <p>7</p>
         </c>
         <c ca="center">
            <p>1.49%</p>
         </c>
         <c ca="left">
            <p>6.28E-04</p>
         </c>
         <c ca="center">
            <p>0.00719142</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>p53 signaling pathway</p>
         </c>
         <c ca="left">
            <p>9</p>
         </c>
         <c ca="center">
            <p>1.91%</p>
         </c>
         <c ca="left">
            <p>9.87E-04</p>
         </c>
         <c ca="center">
            <p>0.01107482</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Cell cycle</p>
         </c>
         <c ca="left">
            <p>11</p>
         </c>
         <c ca="center">
            <p>2.34%</p>
         </c>
         <c ca="left">
            <p>0.002362</p>
         </c>
         <c ca="center">
            <p>0.02596788</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Starch and sucrose metabolism</p>
         </c>
         <c ca="left">
            <p>8</p>
         </c>
         <c ca="center">
            <p>1.70%</p>
         </c>
         <c ca="left">
            <p>0.002805</p>
         </c>
         <c ca="center">
            <p>0.03161464</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Androgen and estrogen metabolism</p>
         </c>
         <c ca="left">
            <p>7</p>
         </c>
         <c ca="center">
            <p>1.49%</p>
         </c>
         <c ca="left">
            <p>0.002846</p>
         </c>
         <c ca="center">
            <p>0.03237976</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Polyunsaturated fatty acid biosynthesis</p>
         </c>
         <c ca="left">
            <p>4</p>
         </c>
         <c ca="center">
            <p>0.85%</p>
         </c>
         <c ca="left">
            <p>0.015829</p>
         </c>
         <c ca="center">
            <p>0.1738869</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Metabolism of xenobiotics by cytochrome P450</p>
         </c>
         <c ca="left">
            <p>7</p>
         </c>
         <c ca="center">
            <p>1.49%</p>
         </c>
         <c ca="left">
            <p>0.022415</p>
         </c>
         <c ca="center">
            <p>0.2321759</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Glycolysis/Gluconeogenesis</p>
         </c>
         <c ca="left">
            <p>6</p>
         </c>
         <c ca="center">
            <p>1.28%</p>
         </c>
         <c ca="left">
            <p>0.024584</p>
         </c>
         <c ca="center">
            <p>0.253719</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Fructose and mannose metabolism</p>
         </c>
         <c ca="left">
            <p>5</p>
         </c>
         <c ca="center">
            <p>1.06%</p>
         </c>
         <c ca="left">
            <p>0.054919</p>
         </c>
         <c ca="center">
            <p>0.4893802</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Galactose metabolism</p>
         </c>
         <c ca="left">
            <p>4</p>
         </c>
         <c ca="center">
            <p>0.85%</p>
         </c>
         <c ca="left">
            <p>0.075736</p>
         </c>
         <c ca="center">
            <p>0.6119058</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>PPAR signaling pathway</p>
         </c>
         <c ca="left">
            <p>6</p>
         </c>
         <c ca="center">
            <p>1.28%</p>
         </c>
         <c ca="left">
            <p>0.083934</p>
         </c>
         <c ca="center">
            <p>0.6456008</p>
         </c>
      </r>
   </tblbdy></tbl>
<tbl id="T3"><title><p>Table 3</p></title><caption><p>GSEA analysis results for the gene ranking list generated by Notch pathway-guided COCA approach</p></caption><tblbdy cols="2">
      <r>
         <c ca="left">
            <p>
               <b>Pathway Term</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>GSEA FDR</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Notch signaling pathway</b>
            </p>
         </c>
         <c ca="left">
            <p>0.0133129</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>DNA replication</p>
         </c>
         <c ca="left">
            <p>0.0775257</p>
         </c>
      </r>
   </tblbdy></tbl>
<tbl id="T4"><title><p>Table 4</p></title><caption><p>GSEA analysis results for the gene ranking list generated by variance-based ranking (VR)</p></caption><tblbdy cols="2">
      <r>
         <c ca="left">
            <p>
               <b>Pathway Term</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>GSEA FDR</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Ribosome</p>
         </c>
         <c ca="center">
            <p>0.000129461</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Parkinson's disease</p>
         </c>
         <c ca="center">
            <p>0.0279572</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Metabolic pathways</p>
         </c>
         <c ca="center">
            <p>0.0387599</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Oxidative phosphorylation</p>
         </c>
         <c ca="center">
            <p>0.0387599</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Homologous recombination</p>
         </c>
         <c ca="center">
            <p>0.0548052</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>DNA replication</p>
         </c>
         <c ca="center">
            <p>0.0694425</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Nucleotide excision repair</p>
         </c>
         <c ca="center">
            <p>0.0694425</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Cell cycle</p>
         </c>
         <c ca="center">
            <p>0.0852042</p>
         </c>
      </r>
   </tblbdy></tbl>
<p>Figure S4 in Additional file <supplr sid="S1">1</supplr> shows a Venn diagram of the top 500 genes of those pathways as detected by the COCA ranking approach. As illustrated, most genes are unique to a single pathway and thus pathway-specific, while other genes are common among different pathways, suggestive of possible crosstalk between these pathways. For example, MYO10 and MYL9 are shared between Notch and TGF pathways, while IGF2, APP and S100A6 are common in all the pathways examined. Many of top ranked genes identified by the COCA approach are transcription factors (Table S9 and Table S10). Similarly, some transcription factors are pathway-specific, while others are common among different pathways. For example, the following three transcript factors, JARID2, SOX2 and PITX2, are among the shared transcript factors between Notch and TGF&#946; pathways, which play a critical role in controlling self-renewal and differentiation of ESCs <abbrgrp>
<abbr bid="B32">32</abbr>
<abbr bid="B39">39</abbr>
</abbrgrp>. More interestingly, many oncogenes are among the top ranked genes in each pathways by the COCA approach (Table S9 and Table S10; see Figure <figr fid="F5">5</figr> for an example), which reaffirms the notion that stem cells are similar to cancer cells on the molecular levels <abbrgrp>
<abbr bid="B32">32</abbr>
</abbrgrp>.</p>
<p>Specifically we also examined the top 20 genes by looking into their annotations (Table S12(a)-(d)). Within the top 20 genes ranked by Notch pathway-guided COCA (Table S12(a)), there are several genes related to differentiation (Tdgf1, Egr1 and Lefty1), cell growth (Ddit4, Hk2, Phlda2, Egln3 and Igfbp1) and tumor/cancer development (Afp, Sfrp2, Egr1, Hk2 and Phlda2). Some of them are also related to the determination of certain organ as demonstrated by biological studies. For examples, Tdgf1 (Teratocarcinoma-derived growth factor 1, also known as Cripto-1 growth factor) could play a role in the determination of the epiblastic cells that subsequently give rise to the mesoderm <abbrgrp>
<abbr bid="B40">40</abbr>
</abbrgrp>, and it also contributes to deregulated growth of cancer cells <abbrgrp>
<abbr bid="B41">41</abbr>
</abbrgrp>. Note that Tdgf1 was ranked No. 2 by Notch pathway-guided COCA ranking but was ranked No. 906 by variance-based ranking (VR), suggesting that COCA can efficiently boost the ranking of biologically relevant genes. Another gene, left-right determination factor 1 (Lefty1), is known to play a major role during mouse gastrulation and transiently expressed during human embryonic stem cell differentiation <abbrgrp>
<abbr bid="B42">42</abbr>
</abbrgrp>. We also note that Lefty1 was ranked No. 9 by COCA ranking but was ranked No. 11,900 by VR, once again suggesting the effectiveness of the COCA approach.</p>
<p>Taking together, the results obtained from the COCA approach provide not only new insights into the complex system of signaling pathways, but also new clues to investigate the molecular mechanisms underlying ESC development. We believe that COCA is of great potential to be utilized in many other studies to help identify biologically meaningful candidate genes and improve our understanding of biological pathways.</p>
</sec>
</sec>
<sec>
<st>
<p>Discussions</p>
</st>
<p>Gene ranking is an important task in genomic data analysis to provide biologists with candidate genes of mechanistic interest for further study. However, single gene-based approaches, such as fold-change and SAM <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>, suffer from the large noise in microarray data, particularly when the signal-to-noise ratio is relatively low, making gene ranking unreliable. This limitation has motivated many researchers to integrate biological knowledge into data analysis for reliable gene ranking <abbrgrp>
<abbr bid="B8">8</abbr>
<abbr bid="B9">9</abbr>
</abbrgrp>. For integration, one must keep in mind that different information sources may not always be sufficiently robust, complete and/or accurate for integration. COCA tries to address this problem by finding a coordinative component from the observation, providing a semi-supervised learning approach for optimization in contrast to combining knowledge and observation heuristically. Such a semi-supervised learning scheme is also a practical solution to the problem, since biological knowledge itself contains false-positives and false-negatives from several sources. For example, knowledge of gene function is often obtained from other biological experiments that contain noise, and the knowledge can be incomplete, too general, and frequently not condition-specific thus irrelevant to the biological conditions under study. Therefore, in the proposed approach, knowledge genes are used to provide guidance only rather than forcing the algorithm to abide by biological knowledge.</p>
<p>Un-supervised methods, not relying on any prior knowledge, could serve as exploratory tools to reveal interesting gene patterns or potential phenotype groupings at an initial data analysis stage. However, for the study with certain biological focus, e.g., looking for the genes related to given biological processes or pathways, semi-supervised or supervised methods are more appropriate to employ than un-supervised methods. If we have sufficient confidence about the knowledge that we have, supervised learning is usually powerful enough to guide us finding important clues. However, since biological knowledge is usually incomplete, supervised methods could be biased and misleading. That is also one of our motivations to perform semi-supervised learning, i.e., using knowledge as the guidance and simultaneously looking at the characteristics of data. Therefore, one should choose un-supervised, semi-supervised or supervised methods in different situations, according to the availability and quality of biological knowledge. It could also be a practical strategy to combine them in order to confirm the findings from different views.</p>
<p>Notice that the optimization criterion defined in Eq. (2) of the COCA approach is similar, at least in principle, to that of a linear discriminate analysis (LDA) approach <abbrgrp>
<abbr bid="B43">43</abbr>
</abbrgrp>. In LDA (a supervised learning approach), the criterion is to maximize the ratio of between-class variance to within-class variance; the optimal linear transformation is obtained by maximizing the separability of two classes. The criterion in COCA designed to enable a semi-supervised learning to extract the component of interest guided by prior knowledge genes; the linear transformation is constructed so as to maximize the likelihood of positive knowledge masking with respect to negative knowledge masking.</p>
<p>The importance of biological guidance as incorporated in the COCA approach also needs further discussion. Recently, many statistical decomposition methods have been applied to microarray data in an attempt to elucidate the underlying biological mechanisms <abbrgrp>
<abbr bid="B9">9</abbr>
<abbr bid="B13">13</abbr>
<abbr bid="B14">14</abbr>
<abbr bid="B21">21</abbr>
<abbr bid="B44">44</abbr>
</abbrgrp>. However, many of these methods lack an appropriate consideration of biological relevance. Statistical assumptions, such as uncorrelatedness for PCA and independence for ICA, may not be valid in many biological processes, pathways or networks. For example, biological processes or pathways often exhibit redundancy in their signaling and cross talk with other signaling pathways to keep the system robust. Each of these violates the statistical assumptions in PCA and ICA, respectively. Consequently, many statistical decomposition methods are incapable of revealing underlying biological mechanisms. Even if the statistical assumption is considered to be broadly acceptable, improper model selection in any statistical decomposition method will likely bias the results. For example, ICA with an improper model order will either miss important components or generate false components. Cross-validation is often used to select a suitable model order for prediction based on a generalization of model performance. However, it is computationally demanding to evaluate all of the model orders exhaustively; in many cases, even an appropriate model order cannot guarantee the biological relevance of the corresponding results.</p>
<p>COCA has several advantages over conventional statistical decomposition methods such as PCA and ICA. COCA is guided by biological knowledge with the goal of extracting the coordinative component related to a specific biological process or pathway. COCA is also an optimization approach to maximize a coordinative participation ratio of pathway members to non-pathway members. Indeed, the ratio implicitly incorporates a negative reference to the knowledge to make the result biologically comparable. The estimated coordinative component is thus biologically relevant and condition-specific for the study. In addition, COCA avoids the model selection problem by extracting only the desired component rather than performing unnecessary decomposition to uncover all the components underneath. The bootstrapping procedure in COCA further prevents over-fitting of the algorithm when the noise level is relatively high within the data.</p>
<p>Although the exact value of participation matrix (<b>A</b>) needs to be estimated according to expression observations in given biological condition, some prior information is available such as predefined memberships of certain pathways. The knowledge can come from different knowledge databases such as KEGG, GO and TRANSPATH, or other knockout (or knockdown) biological experiments. The merit to utilize such prior knowledge is that we can have a clear biological context of the study and a better idea to interpret the results from data analysis. The weakness is that these external knowledge sources may be too generic and not specific enough to describe particular biological situations that we encounter. This, as a matter of fact, is our motivation to propose the COCA approach to utilize prior knowledge but also re-evaluate the knowledge later by participation matrix estimation.</p>
<p>It is worth pointing out that COCA is different from some gene grouping methods that use knowledge to cluster knowledge-related genes together. Here, we would like to highlight some key points that differentiate COCA from gene grouping methods. Firstly, COCA uses knowledge genes to guide the estimation of coordinative direction and such estimation reflects the consistency between the knowledge and the data under certain biological condition. Secondly, while gene grouping methods tend to stick to the originally given knowledge genes, COCA ranks the genes according to the estimated coordinative direction, hence, in a condition-specific manner. Finally, gene grouping methods mainly pay attention to the pattern similarity as calculated directly from gene expression data (<b>X</b>), COCA, in contrast, ranks the genes according to their underlying participation matrix (<b>A</b>).</p>
<p>Different from traditional gene ranking schemes mainly focusing on the statistical characteristics of data alone, COCA was proposed to rank the genes according to both data and available biological knowledge. However, if relevant biological knowledge is not available, traditional methods still play a major role in prioritizing genes for biological studies. For the study with some confirmed knowledge already known, COCA may serve as a more specific tool for gene ranking, providing an alternative angle to analyze the data.</p>
</sec>
<sec>
<st>
<p>Conclusion</p>
</st>
<p>In this paper, we have proposed a knowledge-guided method called coordinative component analysis (COCA) for reliable mechanistic gene ranking. The method utilizes partial biological knowledge genes to find coordinative components representing the underlying biological processes or pathways; microarray gene expression data are then projected onto the coordinative components to estimate the participation strengths of genes, these strengths are then used to rank the genes. COCA is mathematically formulated as an optimization problem to maximize the coordinative contribution of member genes to a pathway or network. A bootstrapping procedure has been further developed to overcome the over-fitting problem and provide COCA with a confidence measure for each estimated coordinative component. The proposed COCA approach has been tested with several simulation data and real microarray data, showing an improved performance in gene ranking compared to traditional statistical methods like fold-change, SAM <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> and EDGE <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>. The application of the method to stem cell data has revealed several transcript factors and oncogenes associated with the system development and signaling pathways that are potentially related to cancers. In the future, we will validate the findings through biological experiments to establish their functional role in embryonic development of stem cells. Furthermore, we plan to fully test the proposed method on multiple related data sets to show that COCA can provide us improved ranking results with small variability across the data sets and large relevance to biological pathways.</p>
</sec>
<sec>
<st>
<p>Authors' contributions</p>
</st>
<p>CW and JX formulated the problem and developed the theoretical framework of the algorithm. CW carried out the development and implementation of the algorithm. HL and MZ directed the application of the algorithm to the stem cell data set. YW, EPH and RC provided technical and biological support to the project. All authors participated in the writing of the manuscript, and have read and approved the manuscript.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>This research was supported in part by NIH Grants (NS29525-13A, EB000830, CA109872, CA096483, CA129080 and CA139246) and DoD/CDMRP Grant (BC030280). HL and MZ were supported by IRP/NIA/NIH.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>Significance analysis of microarrays applied to the ionizing radiation response</p></title><aug><au><snm>Tusher</snm><fnm>VG</fnm></au><au><snm>Tibshirani</snm><fnm>R</fnm></au><au><snm>Chu</snm><fnm>G</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2001</pubdate><volume>98</volume><issue>9</issue><fpage>5116</fpage><lpage>5121</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.091062498</pubid><pubid idtype="pmcid">33173</pubid><pubid idtype="pmpid">11309499</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Gene ranking using bootstrapped P-values</p></title><aug><au><snm>Mukherjee</snm><fnm>SN</fnm></au><au><snm>Roberts</snm><fnm>SJ</fnm></au><au><snm>Sykacek</snm><fnm>P</fnm></au><au><snm>Gurr</snm><fnm>SJ</fnm></au></aug><source>SIGKDD Explor Newsl</source><pubdate>2003</pubdate><volume>5</volume><issue>2</issue><fpage>16</fpage><lpage>22</lpage><xrefbib><pubid idtype="doi">10.1145/980972.980976</pubid></xrefbib></bibl><bibl id="B3"><title><p>The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements</p></title><aug><au><snm>Shi</snm><fnm>L</fnm></au><au><snm>Reid</snm><fnm>LH</fnm></au><au><snm>Jones</snm><fnm>WD</fnm></au><au><snm>Shippy</snm><fnm>R</fnm></au><au><snm>Warrington</snm><fnm>JA</fnm></au><au><snm>Baker</snm><fnm>SC</fnm></au><au><snm>Collins</snm><fnm>PJ</fnm></au><au><snm>de Longueville</snm><fnm>F</fnm></au><au><snm>Kawasaki</snm><fnm>ES</fnm></au><au><snm>Lee</snm><fnm>KY</fnm></au><etal/></aug><source>Nat Biotechnol</source><pubdate>2006</pubdate><volume>24</volume><issue>9</issue><fpage>1151</fpage><lpage>1161</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nbt1239</pubid><pubid idtype="pmpid" link="fulltext">16964229</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data</p></title><aug><au><snm>Chen</snm><fnm>JJ</fnm></au><au><snm>Hsueh</snm><fnm>HM</fnm></au><au><snm>Delongchamp</snm><fnm>RR</fnm></au><au><snm>Lin</snm><fnm>CJ</fnm></au><au><snm>Tsai</snm><fnm>CA</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2007</pubdate><volume>8</volume><fpage>412</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-8-412</pubid><pubid idtype="pmcid">2204045</pubid><pubid idtype="pmpid">17961233</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach</p></title><aug><au><snm>Opgen-Rhein</snm><fnm>R</fnm></au><au><snm>Strimmer</snm><fnm>K</fnm></au></aug><source>Stat Appl Genet Mol Biol</source><pubdate>2007</pubdate><volume>6</volume><note>Article9</note><xrefbib><pubid idtype="pmpid" link="fulltext">17402924</pubid></xrefbib></bibl><bibl id="B6"><title><p>Significance analysis of time course microarray experiments</p></title><aug><au><snm>Storey</snm><fnm>JD</fnm></au><au><snm>Xiao</snm><fnm>W</fnm></au><au><snm>Leek</snm><fnm>JT</fnm></au><au><snm>Tompkins</snm><fnm>RG</fnm></au><au><snm>Davis</snm><fnm>RW</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2005</pubdate><volume>102</volume><issue>36</issue><fpage>12837</fpage><lpage>12842</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0504609102</pubid><pubid idtype="pmcid">1201697</pubid><pubid idtype="pmpid">16141318</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Entropy-based gene ranking without selection bias for the predictive classification of microarray data</p></title><aug><au><snm>Furlanello</snm><fnm>C</fnm></au><au><snm>Serafini</snm><fnm>M</fnm></au><au><snm>Merler</snm><fnm>S</fnm></au><au><snm>Jurman</snm><fnm>G</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2003</pubdate><volume>4</volume><fpage>54</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-4-54</pubid><pubid idtype="pmcid">293475</pubid><pubid idtype="pmpid">14604446</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>GeneRank: using search engine technology for the analysis of microarray experiments</p></title><aug><au><snm>Morrison</snm><fnm>JL</fnm></au><au><snm>Breitling</snm><fnm>R</fnm></au><au><snm>Higham</snm><fnm>DJ</fnm></au><au><snm>Gilbert</snm><fnm>DR</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2005</pubdate><volume>6</volume><fpage>233</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-6-233</pubid><pubid idtype="pmcid">1261158</pubid><pubid idtype="pmpid">16176585</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>CGI: a new approach for prioritizing genes by combining gene expression and protein-protein interaction data</p></title><aug><au><snm>Ma</snm><fnm>X</fnm></au><au><snm>Lee</snm><fnm>H</fnm></au><au><snm>Wang</snm><fnm>L</fnm></au><au><snm>Sun</snm><fnm>F</fnm></au></aug><source>Bioinformatics</source><pubdate>2007</pubdate><volume>23</volume><issue>2</issue><fpage>215</fpage><lpage>221</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl569</pubid><pubid idtype="pmpid" link="fulltext">17098772</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>An Introduction to the Bootstrap</p></title><aug><au><snm>Bradley Efron</snm><fnm>RJT</fnm></au></aug><publisher>New York, Chapman &amp; Hall/CRC</publisher><pubdate>1994</pubdate></bibl><bibl id="B11"><title><p>A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification</p></title><aug><au><snm>Jiang</snm><fnm>W</fnm></au><au><snm>Simon</snm><fnm>R</fnm></au></aug><source>Stat Med</source><pubdate>2007</pubdate><volume>26</volume><issue>29</issue><fpage>5320</fpage><lpage>5334</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/sim.2968</pubid><pubid idtype="pmpid" link="fulltext">17624926</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>Linear models for microarray data analysis: hidden similarities and differences</p></title><aug><au><snm>Kerr</snm><fnm>MK</fnm></au></aug><source>J Comput Biol</source><pubdate>2003</pubdate><volume>10</volume><issue>6</issue><fpage>891</fpage><lpage>901</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/106652703322756131</pubid><pubid idtype="pmpid" link="fulltext">14980016</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>Pathway level analysis of gene expression using singular value decomposition</p></title><aug><au><snm>Tomfohr</snm><fnm>J</fnm></au><au><snm>Lu</snm><fnm>J</fnm></au><au><snm>Kepler</snm><fnm>TB</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2005</pubdate><volume>6</volume><fpage>225</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-6-225</pubid><pubid idtype="pmcid">1261155</pubid><pubid idtype="pmpid">16156896</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Nonnegative matrix factorization: an analytical and interpretive tool in computational biology</p></title><aug><au><snm>Devarajan</snm><fnm>K</fnm></au></aug><source>PLoS Comput Biol</source><pubdate>2008</pubdate><volume>4</volume><issue>7</issue><fpage>e1000029</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pcbi.1000029</pubid><pubid idtype="pmcid">2447881</pubid><pubid idtype="pmpid">18654623</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>bioNMF: a versatile tool for non-negative matrix factorization in biology</p></title><aug><au><snm>Pascual-Montano</snm><fnm>A</fnm></au><au><snm>Carmona-Saez</snm><fnm>P</fnm></au><au><snm>Chagoyen</snm><fnm>M</fnm></au><au><snm>Tirado</snm><fnm>F</fnm></au><au><snm>Carazo</snm><fnm>JM</fnm></au><au><snm>Pascual-Marqui</snm><fnm>RD</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2006</pubdate><volume>7</volume><fpage>366</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-7-366</pubid><pubid idtype="pmcid">1550731</pubid><pubid idtype="pmpid">16875499</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Elucidating the altered transcriptional programs in breast cancer using independent component analysis</p></title><aug><au><snm>Teschendorff</snm><fnm>AE</fnm></au><au><snm>Journee</snm><fnm>M</fnm></au><au><snm>Absil</snm><fnm>PA</fnm></au><au><snm>Sepulchre</snm><fnm>R</fnm></au><au><snm>Caldas</snm><fnm>C</fnm></au></aug><source>PLoS Comput Biol</source><pubdate>2007</pubdate><volume>3</volume><issue>8</issue><fpage>e161</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pcbi.0030161</pubid><pubid idtype="pmcid">1950343,1950343</pubid><pubid idtype="pmpid">17708679</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>Network component analysis: reconstruction of regulatory signals in biological systems</p></title><aug><au><snm>Liao</snm><fnm>JC</fnm></au><au><snm>Boscolo</snm><fnm>R</fnm></au><au><snm>Yang</snm><fnm>YL</fnm></au><au><snm>Tran</snm><fnm>LM</fnm></au><au><snm>Sabatti</snm><fnm>C</fnm></au><au><snm>Roychowdhury</snm><fnm>VP</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2003</pubdate><volume>100</volume><issue>26</issue><fpage>15522</fpage><lpage>15527</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.2136632100</pubid><pubid idtype="pmcid">307600</pubid><pubid idtype="pmpid">14673099</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>Transcriptome network component analysis with limited microarray data</p></title><aug><au><snm>Galbraith</snm><fnm>SJ</fnm></au><au><snm>Tran</snm><fnm>LM</fnm></au><au><snm>Liao</snm><fnm>JC</fnm></au></aug><source>Bioinformatics</source><pubdate>2006</pubdate><volume>22</volume><issue>15</issue><fpage>1886</fpage><lpage>1894</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl279</pubid><pubid idtype="pmpid" link="fulltext">16766556</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>Unraveling transcriptional regulatory programs by integrative analysis of microarray and transcription factor binding data</p></title><aug><au><snm>Li</snm><fnm>H</fnm></au><au><snm>Zhan</snm><fnm>M</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><issue>17</issue><fpage>1874</fpage><lpage>1880</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn332</pubid><pubid idtype="pmcid">2519161</pubid><pubid idtype="pmpid">18586698</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>Motif-directed network component analysis for regulatory network inference</p></title><aug><au><snm>Wang</snm><fnm>C</fnm></au><au><snm>Xuan</snm><fnm>J</fnm></au><au><snm>Chen</snm><fnm>L</fnm></au><au><snm>Zhao</snm><fnm>P</fnm></au><au><snm>Wang</snm><fnm>Y</fnm></au><au><snm>Clarke</snm><fnm>R</fnm></au><au><snm>Hoffman</snm><fnm>E</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2008</pubdate><volume>9</volume><issue>Suppl (S1)</issue><fpage>S21</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-9-S1-S21</pubid><pubid idtype="pmcid">2259422</pubid><pubid idtype="pmpid">18315853</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>Application of independent component analysis to microarrays</p></title><aug><au><snm>Lee</snm><fnm>SI</fnm></au><au><snm>Batzoglou</snm><fnm>S</fnm></au></aug><source>Genome Biol</source><pubdate>2003</pubdate><volume>4</volume><issue>11</issue><fpage>R76</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2003-4-11-r76</pubid><pubid idtype="pmcid">329130</pubid><pubid idtype="pmpid">14611662</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Extensive low-affinity transcriptional interactions in the yeast genome</p></title><aug><au><snm>Tanay</snm><fnm>A</fnm></au></aug><source>Genome Res</source><pubdate>2006</pubdate><volume>16</volume><issue>8</issue><fpage>962</fpage><lpage>972</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.5113606</pubid><pubid idtype="pmcid">1524868</pubid><pubid idtype="pmpid">16809671</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>Simultaneous perturbation stochastic approximation of nonsmooth functions</p></title><aug><au><snm>Bartkute</snm><fnm>V</fnm></au><au><snm>Sakalauskas</snm><fnm>L</fnm></au></aug><source>European Journal of Operational Research</source><pubdate>2007</pubdate><volume>181</volume><issue>3</issue><fpage>1174</fpage><lpage>1188</lpage><xrefbib><pubid idtype="doi">10.1016/j.ejor.2005.09.052</pubid></xrefbib></bibl><bibl id="B24"><title><p>Bagging predictors</p></title><aug><au><snm>Breiman</snm><fnm>L</fnm></au></aug><source>Machine Learning; 1996</source><pubdate>1996</pubdate><fpage>123</fpage><lpage>140</lpage></bibl><bibl id="B25"><title><p>BagBoosting for tumor classification with gene expression data</p></title><aug><au><snm>Dettling</snm><fnm>M</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><issue>18</issue><fpage>3583</fpage><lpage>3593</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bth447</pubid><pubid idtype="pmpid" link="fulltext">15466910</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><title><p>Bagging to improve the accuracy of a clustering procedure</p></title><aug><au><snm>Dudoit</snm><fnm>S</fnm></au><au><snm>Fridlyand</snm><fnm>J</fnm></au></aug><source>Bioinformatics</source><pubdate>2003</pubdate><volume>19</volume><issue>9</issue><fpage>1090</fpage><lpage>1099</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btg038</pubid><pubid idtype="pmpid" link="fulltext">12801869</pubid></pubidlist></xrefbib></bibl><bibl id="B27"><title><p>Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization</p></title><aug><au><snm>Spellman</snm><fnm>PT</fnm></au><au><snm>Sherlock</snm><fnm>G</fnm></au><au><snm>Zhang</snm><fnm>MQ</fnm></au><au><snm>Iyer</snm><fnm>VR</fnm></au><au><snm>Anders</snm><fnm>K</fnm></au><au><snm>Eisen</snm><fnm>MB</fnm></au><au><snm>Brown</snm><fnm>PO</fnm></au><au><snm>Botstein</snm><fnm>D</fnm></au><au><snm>Futcher</snm><fnm>B</fnm></au></aug><source>Mol Biol Cell</source><pubdate>1998</pubdate><volume>9</volume><issue>12</issue><fpage>3273</fpage><lpage>3297</lpage><xrefbib><pubidlist><pubid idtype="pmcid">25624</pubid><pubid idtype="pmpid">9843569</pubid></pubidlist></xrefbib></bibl><bibl id="B28"><title><p>Therapeutic potential of embryonic stem cells</p></title><aug><au><snm>Lerou</snm><fnm>PH</fnm></au><au><snm>Daley</snm><fnm>GQ</fnm></au></aug><source>Blood Rev</source><pubdate>2005</pubdate><volume>19</volume><issue>6</issue><fpage>321</fpage><lpage>331</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.blre.2005.01.005</pubid><pubid idtype="pmpid" link="fulltext">16275420</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>The therapeutic potential of embryonic stem cells: A focus on stem cell stability</p></title><aug><au><snm>Zeng</snm><fnm>X</fnm></au><au><snm>Rao</snm><fnm>MS</fnm></au></aug><source>Curr Opinion Mol Therap</source><pubdate>2006</pubdate><volume>8</volume><issue>4</issue><fpage>338</fpage><lpage>344</lpage></bibl><bibl id="B30"><title><p>Molecular signature of human embryonic stem cells and its comparison with the mouse</p></title><aug><au><snm>Sato</snm><fnm>N</fnm></au><au><snm>Sanjuan</snm><fnm>IM</fnm></au><au><snm>Heke</snm><fnm>M</fnm></au><au><snm>Uchida</snm><fnm>M</fnm></au><au><snm>Naef</snm><fnm>F</fnm></au><au><snm>Brivanlou</snm><fnm>AH</fnm></au></aug><source>Dev Biol</source><pubdate>2003</pubdate><volume>260</volume><issue>2</issue><fpage>404</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0012-1606(03)00256-2</pubid><pubid idtype="pmpid" link="fulltext">12921741</pubid></pubidlist></xrefbib></bibl><bibl id="B31"><title><p>Monitoring early differentiation events in human embryonic stem cells by massively parallel signature sequencing and expressed sequence tag scan</p></title><aug><au><snm>Miura</snm><fnm>T</fnm></au><au><snm>Luo</snm><fnm>Y</fnm></au><au><snm>Khrebtukova</snm><fnm>I</fnm></au><au><snm>Brandenberger</snm><fnm>R</fnm></au><au><snm>Zhou</snm><fnm>D</fnm></au><au><snm>Thies</snm><fnm>RS</fnm></au><au><snm>Vasicek</snm><fnm>T</fnm></au><au><snm>Young</snm><fnm>H</fnm></au><au><snm>Lebkowski</snm><fnm>J</fnm></au><au><snm>Carpenter</snm><fnm>MK</fnm></au><etal/></aug><source>Stem Cells Dev</source><pubdate>2004</pubdate><volume>13</volume><issue>6</issue><fpage>694</fpage><lpage>715</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/scd.2004.13.694</pubid><pubid idtype="pmpid" link="fulltext">15684837</pubid></pubidlist></xrefbib></bibl><bibl id="B32"><title><p>Genomic studies to explore self-renewal and differentiation properties of embryonic stem cells</p></title><aug><au><snm>Zhan</snm><fnm>M</fnm></au></aug><source>Front Biosci</source><pubdate>2008</pubdate><volume>13</volume><fpage>276</fpage><lpage>283</lpage><xrefbib><pubidlist><pubid idtype="doi">10.2741/2678</pubid><pubid idtype="pmpid" link="fulltext">17981546</pubid></pubidlist></xrefbib></bibl><bibl id="B33"><title><p>Gene function in early mouse embryonic stem cell differentiation</p></title><aug><au><snm>Hailesellasse Sene</snm><fnm>K</fnm></au><au><snm>Porter</snm><fnm>CJ</fnm></au><au><snm>Palidwor</snm><fnm>G</fnm></au><au><snm>Perez-Iratxeta</snm><fnm>C</fnm></au><au><snm>Muro</snm><fnm>EM</fnm></au><au><snm>Campbell</snm><fnm>PA</fnm></au><au><snm>Rudnicki</snm><fnm>MA</fnm></au><au><snm>Andrade-Navarro</snm><fnm>MA</fnm></au></aug><source>BMC Genomics</source><pubdate>2007</pubdate><volume>8</volume><fpage>85</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2164-8-85</pubid><pubid idtype="pmcid">1851713</pubid><pubid idtype="pmpid">17394647</pubid></pubidlist></xrefbib></bibl><bibl id="B34"><title><p>Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources</p></title><aug><au><snm>Huang da</snm><fnm>W</fnm></au><au><snm>Sherman</snm><fnm>BT</fnm></au><au><snm>Lempicki</snm><fnm>RA</fnm></au></aug><source>Nat Protoc</source><pubdate>2009</pubdate><volume>4</volume><issue>1</issue><fpage>44</fpage><lpage>57</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nprot.2008.211</pubid><pubid idtype="pmpid" link="fulltext">19131956</pubid></pubidlist></xrefbib></bibl><bibl id="B35"><title><p>The Notch ligand JAG1 is required for sensory progenitor development in the mammalian inner ear</p></title><aug><au><snm>Kiernan</snm><fnm>AE</fnm></au><au><snm>Xu</snm><fnm>J</fnm></au><au><snm>Gridley</snm><fnm>T</fnm></au></aug><source>PLoS Genet</source><pubdate>2006</pubdate><volume>2</volume><issue>1</issue><fpage>e4</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.0020004</pubid><pubid idtype="pmcid">1326221,1326221</pubid><pubid idtype="pmpid">16410827</pubid></pubidlist></xrefbib></bibl><bibl id="B36"><title><p>Role of Sox2 in the development of the mouse neocortex</p></title><aug><au><snm>Bani-Yaghoub</snm><fnm>M</fnm></au><au><snm>Tremblay</snm><fnm>RG</fnm></au><au><snm>Lei</snm><fnm>JX</fnm></au><au><snm>Zhang</snm><fnm>D</fnm></au><au><snm>Zurakowski</snm><fnm>B</fnm></au><au><snm>Sandhu</snm><fnm>JK</fnm></au><au><snm>Smith</snm><fnm>B</fnm></au><au><snm>Ribecco-Lutkiewicz</snm><fnm>M</fnm></au><au><snm>Kennedy</snm><fnm>J</fnm></au><au><snm>Walker</snm><fnm>PR</fnm></au><etal/></aug><source>Dev Biol</source><pubdate>2006</pubdate><volume>295</volume><issue>1</issue><fpage>52</fpage><lpage>66</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ydbio.2006.03.007</pubid><pubid idtype="pmpid" link="fulltext">16631155</pubid></pubidlist></xrefbib></bibl><bibl id="B37"><title><p>Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles</p></title><aug><au><snm>Subramanian</snm><fnm>A</fnm></au><au><snm>Tamayo</snm><fnm>P</fnm></au><au><snm>Mootha</snm><fnm>VK</fnm></au><au><snm>Mukherjee</snm><fnm>S</fnm></au><au><snm>Ebert</snm><fnm>BL</fnm></au><au><snm>Gillette</snm><fnm>MA</fnm></au><au><snm>Paulovich</snm><fnm>A</fnm></au><au><snm>Pomeroy</snm><fnm>SL</fnm></au><au><snm>Golub</snm><fnm>TR</fnm></au><au><snm>Lander</snm><fnm>ES</fnm></au><etal/></aug><source>Proc Natl Acad Sci USA</source><pubdate>2005</pubdate><volume>102</volume><issue>43</issue><fpage>15545</fpage><lpage>15550</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0506580102</pubid><pubid idtype="pmcid">1239896,1239896</pubid><pubid idtype="pmpid">16199517</pubid></pubidlist></xrefbib></bibl><bibl id="B38"><title><p>GeneTrail--advanced gene set enrichment analysis</p></title><aug><au><snm>Backes</snm><fnm>C</fnm></au><au><snm>Keller</snm><fnm>A</fnm></au><au><snm>Kuentzer</snm><fnm>J</fnm></au><au><snm>Kneissl</snm><fnm>B</fnm></au><au><snm>Comtesse</snm><fnm>N</fnm></au><au><snm>Elnakady</snm><fnm>YA</fnm></au><au><snm>Muller</snm><fnm>R</fnm></au><au><snm>Meese</snm><fnm>E</fnm></au><au><snm>Lenhof</snm><fnm>HP</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2007</pubdate><issue>35 Web Server</issue><fpage>W186</fpage><lpage>192</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkm323</pubid><pubid idtype="pmcid">1933132</pubid><pubid idtype="pmpid">17526521</pubid></pubidlist></xrefbib></bibl><bibl id="B39"><title><p>Cross-species transcriptional profiles establish a functional portrait of embryonic stem cells</p></title><aug><au><snm>Sun</snm><fnm>Y</fnm></au><au><snm>Li</snm><fnm>H</fnm></au><au><snm>Liu</snm><fnm>Y</fnm></au><au><snm>Shin</snm><fnm>S</fnm></au><au><snm>Mattson</snm><fnm>MP</fnm></au><au><snm>Rao</snm><fnm>MS</fnm></au><au><snm>Zhan</snm><fnm>M</fnm></au></aug><source>Genomics</source><pubdate>2007</pubdate><volume>89</volume><issue>1</issue><fpage>22</fpage><lpage>35</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ygeno.2006.09.010</pubid><pubid idtype="pmcid">2658876</pubid><pubid idtype="pmpid">17055697</pubid></pubidlist></xrefbib></bibl><bibl id="B40"><title><p>Characterization of the mouse Tdgf1 gene and Tdgf pseudogenes</p></title><aug><au><snm>Liguori</snm><fnm>G</fnm></au><au><snm>Tucci</snm><fnm>M</fnm></au><au><snm>Montuori</snm><fnm>N</fnm></au><au><snm>Dono</snm><fnm>R</fnm></au><au><snm>Lago</snm><fnm>CT</fnm></au><au><snm>Pacifico</snm><fnm>F</fnm></au><au><snm>Armenante</snm><fnm>F</fnm></au><au><snm>Persico</snm><fnm>MG</fnm></au></aug><source>Mamm Genome</source><pubdate>1996</pubdate><volume>7</volume><issue>5</issue><fpage>344</fpage><lpage>348</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1007/s003359900100</pubid><pubid idtype="pmpid" link="fulltext">8661720</pubid></pubidlist></xrefbib></bibl><bibl id="B41"><title><p>Cripto: a tumor growth factor and more</p></title><aug><au><snm>Adamson</snm><fnm>ED</fnm></au><au><snm>Minchiotti</snm><fnm>G</fnm></au><au><snm>Salomon</snm><fnm>DS</fnm></au></aug><source>J Cell Physiol</source><pubdate>2002</pubdate><volume>190</volume><issue>3</issue><fpage>267</fpage><lpage>278</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/jcp.10072</pubid><pubid idtype="pmpid" link="fulltext">11857442</pubid></pubidlist></xrefbib></bibl><bibl id="B42"><title><p>Molecular analysis of LEFTY-expressing cells in early human embryoid bodies</p></title><aug><au><snm>Dvash</snm><fnm>T</fnm></au><au><snm>Sharon</snm><fnm>N</fnm></au><au><snm>Yanuka</snm><fnm>O</fnm></au><au><snm>Benvenisty</snm><fnm>N</fnm></au></aug><source>Stem Cells</source><pubdate>2007</pubdate><volume>25</volume><issue>2</issue><fpage>465</fpage><lpage>472</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1634/stemcells.2006-0179</pubid><pubid idtype="pmpid" link="fulltext">17038673</pubid></pubidlist></xrefbib></bibl><bibl id="B43"><title><p>Least squares linear discriminant analysis</p></title><aug><au><snm>Jieping</snm><fnm>Y</fnm></au></aug><source>Proceedings of the 24th international conference on Machine learning</source><publisher>Corvalis, Oregon: ACM</publisher><pubdate>2007</pubdate></bibl><bibl id="B44"><title><p>Gene module identification from microarray data using nonnegative independent component analysis</p></title><aug><au><snm>Gong</snm><fnm>T</fnm></au><au><snm>Xuan</snm><fnm>J</fnm></au><au><snm>Wang</snm><fnm>C</fnm></au><au><snm>Li</snm><fnm>H</fnm></au><au><snm>Hoffman</snm><fnm>E</fnm></au><au><snm>Clarke</snm><fnm>R</fnm></au><au><snm>Wang</snm><fnm>Y</fnm></au></aug><source>Gene Regulation and Systems Biology</source><pubdate>2007</pubdate><volume>1</volume><fpage>349</fpage><lpage>363</lpage></bibl></refgrp>
</bm></art>