Department of Electrical and Computer Engineering, Virginia Polytechnic and State University, Arlington, VA 22203, USA

Bioinformatics Unit, RRB, National Institute on Aging, NIH, Baltimore, MD 21224, USA

Department of Electrical Engineering, The Pennsylvania State University, University Park, PA 16802, USA

Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC 20010, USA

Department of Oncology, Physiology & Biophysics and Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20007, USA

Abstract

Background

The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and non-adaptive designs, in addition to the curse of dimensionality and the discernment of uninformative, uninteresting cluster structure associated with confounding variables.

Results

In an effort to partially address these limitations, we develop the VIsual Statistical Data Analyzer (VISDA) for cluster modeling, visualization, and discovery in genomic data. VISDA performs progressive, coarse-to-fine (divisive) hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data visualization, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The hierarchical visualization and clustering scheme of VISDA uses multiple local visualization subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures in a "divide and conquer" scenario. Multiple projection methods, each sensitive to a distinct type of clustering tendency, are used for data visualization, which increases the likelihood that cluster structures of interest are revealed. Initialization of the full dimensional model is based on first learning models with user/prior knowledge guidance on data projected into the low-dimensional visualization spaces. Model order selection for the high dimensional data is accomplished by Bayesian theoretic criteria and user justification applied via the hierarchy of low-dimensional visualization subspaces. Based on its complementary building blocks and flexible functionality, VISDA is generally applicable for gene clustering, sample clustering, and phenotype clustering (wherein phenotype labels for samples are known), albeit with minor algorithm modifications customized to each of these tasks.

Conclusion

VISDA achieved robust and superior clustering accuracy, compared with several benchmark clustering schemes. The model order selection scheme in VISDA was shown to be effective for high dimensional genomic data clustering. On muscular dystrophy data and muscle regeneration data, VISDA identified biologically relevant co-expressed gene clusters. VISDA also captured the pathological relationships among different phenotypes revealed at the molecular level, through phenotype clustering on muscular dystrophy data and multi-category cancer data.

Background

Due to limited existing biological knowledge at the molecular level, clustering has become a popular and effective method to extract information from genomic data. Genomic data clustering may help to discover novel functional gene groups, gene regulation networks, phenotypes/sub-phenotypes, and developmental/morphological relationships among phenotypes

While there is a rich variety of existing methods, unfortunately when clustering genomic data, most of them suffer from several major limitations, which we summarize as follows. (1) Clustering methods such as KMC and mixture model fitting are sensitive to the quality of model initialization and may converge to poor local optima of the objective function, which will yield inaccurate clustering outcomes, especially when applied to genomic datasets that have high dimensionality and small sample size

**caBIG™ VISDA: modeling, visualization, and discovery for cluster analysis of genomic data (supplement).** The supplement includes derivations and details of the algorithm, more discussions, and introduction of the datasets used in the experiments.

Click here for file

To address some of the existing methods' limitations outlined above and design a comprehensive and flexible clustering tool effectively applicable to cluster modeling, visualization, and discovery on genomic data, we developed a hierarchical data exploration and clustering approach, the VIsual Statistical Data Analyzer (VISDA). VISDA performs progressive, divisive hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data projection, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The data exploration process in VISDA starts from the top level where the whole dataset is viewed as a cluster, with clusters then hierarchically subdivided in successive layers until all salient structure in the data is revealed. Since a single 2-D data projection, even if it is nonlinear, may be insufficient for revealing all cluster structures in multimodal, high dimensional data, the hierarchical visualization and clustering scheme of VISDA uses multiple local projection subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures. Consistent with the "divide and conquer" principle, each local data projection and modeling can be fulfilled with relatively simple method/model, while the complete hierarchy maintains overall flexibility and conveys considerable clustering information.

The inclusive VISDA framework readily incorporates the advantages of various complementary data clustering and visualization algorithms to visualize the obtained clusters, which not only give a "transparent" clustering process that can enhance the user's understanding of the data structure, but also provide an interface to incorporate human intelligence (e.g. user's discernment of sub-cluster separability and outliers) and domain knowledge to help improve clustering accuracy and avoid finding nonmeaningful or confounding cluster structure. Specifically, the interactive user participation guides the coarse-to-fine cluster discover via (1) the selection of a local visualization, from a suite of data projections, each sensitive to a distinct type of data structure, for best revealing a cluster's substructure; (2) user-directed parameter initialization for the new sub-clusters that divide existing clusters; (3) user-guided model order selection, applied in conjunction with MDL, for deciding the number of sub-clusters in the local visualization space.

Based on its complementary building blocks and flexible functionality, VISDA is comprehensively suitable for multiple genomic data clustering tasks, including gene clustering, sample clustering, and phenotype clustering (wherein phenotype labels for samples are known), albeit with customized modifications for each of these tasks. Specifically, VISDA's sample clustering requires dimensionality reduction via unsupervised informative gene selection, whereas the phenotype clustering algorithm exploits the knowledge of phenotype labels in performing supervised informative gene selection, supervised data visualization, and statistical modeling that preserves the unity of samples from the same phenotype, which fulfils that in phenotype clustering known phenotypes, i.e. groups of samples with the same phenotype label, are taken as data objects to be clustered. An important goal of phenotype clustering is to discover a Tree Of Phenotypes (TOP), i.e. a hierarchical tree structure with all phenotypes as leaves of the tree, which may reflect important biological relationships among the phenotypes.

In this paper, we show that VISDA gives stable and improved clustering accuracy compared to several benchmark clustering methods, i.e. conventional agglomerative Hierarchical Clustering (HC)

Methods

In this section, we first introduce the main steps of VISDA algorithm that directly describe the complete VISDA processing for the task of gene clustering. Next, we extend the algorithm to work on sample clustering by adding unsupervised informative gene selection as a data pre-processing step. Finally, we extend the algorithm for phenotype clustering by incorporating a cluster visualization and decomposition scheme that explicitly utilizes the phenotype category information.

VISDA algorithm

Let **t **= {**t**_{1}, **t**_{2},..., **t**_{N}|**t**_{i }∈ R^{p}, _{l }clusters have already been detected at level **x**_{l}) is _{i, k}.

VISDA's flowchart

**VISDA's flowchart.**

Visualization of cluster

For cluster

(1) Principal Component Analysis (PCA)

(2) Principal Component Analysis – Projection Pursuit Method (PCA-PPM)

(3) Locality Preserving Projection (LPP)

(4) HC-KMC-SFNM-DCA

(5) Affinity Propagation Clustering – Discriminatory Component Analysis (APC-DCA)

In each of the five projection methods, after the projection matrix **W**_{k }for cluster

where **x**_{i }is the image of data point **μ**_{t, k }is the mean of cluster **t**' indicates that these parameters model the data in the high-dimensional original data space. Each point is displayed with an intensity proportional to the posterior probability _{i, k }(or we can set a threshold, and only points with posterior probabilities bigger than this threshold are displayed). Available prior/domain information about the data is also provided to the user via additional user interface. For gene clustering, prior information can be gene annotations, such as gene ID and the functional category. For sample clustering, prior information can be array annotations, such as the experimental condition under which the array was generated.

Each of these five projection methods preserves different yet complementary data structure associated with a distinct type of sub-cluster tendency. PCA preserves directions with largest variation in the data. PCA-PPM moderates PCA to consider projection directions on which the projected data have flat distributions or distributions with thick tails. LPP preserves the neighborhood structure of the data. HC-KMC-SFNM-DCA and APC-DCA directly target presenting discrimination among sub-clusters via different unsupervised partition approaches. HC-KMC-SFNM partitioning is model-based and allows the user to determine the sub-cluster number, while APC partitioning is nonparametric and automatically determines the sub-cluster number. Because each projection method has its own, distinct theoretical or experimental assumption of data structure associated with sub-clusters, while whether the underlying sub-clusters of interest are consistent with these assumptions is data/application dependent, using all of them simultaneously increases the likelihood that sub-clusters of interest are revealed.

After inspecting all five projections, the user is asked to select one projection that best reveals the sub-cluster structure as the final visualization. Human interaction in choosing the best projection (and hence substructure) provides an interface to incorporate human discernment and domain knowledge in cluster visualization, which gives potential to avoid confounding, irrelevant, and uninteresting substructures. The selection of a suitable/good projection is data/application dependent. Several guidelines based on human discernment and prior knowledge are as follows: (1) Select a projection in which the sub-clusters are well-separated and show clear sub-cluster structure. (2) Select a projection in which no sub-clusters are simply composed of several outliers. (3) Select a projection that does not oppose prior knowledge, i.e. if the user is certain about the relationship between some genes/samples under the particular experimental condition that produced the data, he/she can choose a projection that favours this relationship. In addition, when the data size is pretty large, PCA and PCA-PPM may be preferred over HC-KMC-SFNM-DCA, LPP, and APC-DCA, because the latter three projection algorithms have much higher computational complexity. More details, discussion, and empirical understanding of these projections can be found in section 2.1.1 of Additional file

Decomposition of cluster

We use the two-level hierarchical SFNM model to present the relationship between the

where _{k, l+1 }sub-clusters exist at level _{k }is the mixing proportion for cluster _{j|k }is the mixing proportion for sub-cluster **θ**_{t, (k, j) }are the associated parameters of sub-cluster

where **x **= {**x**_{1}, **x**_{2},..., **x**_{N}|**x**_{i }∈ R^{2}, **x**' indicates these parameters model data in the visualization space, and **θ**_{x, (k, j) }are the associated parameters of sub-cluster

where _{i, (k, j) }is the posterior probability of data point **x**_{i }belonging to the **μ**_{x, (k, j) }and **Σ**_{x, (k, j) }are the mean and covariance matrix of sub-cluster

To get an accurate and biologically meaningful initialization of the model parameters, which is a key factor for obtaining a good clustering result, VISDA utilizes human initialization of sub-cluster means in the visualization space. The user pinpoints on the visualization screen where he/she thinks the sub-cluster means should be, according to his/her discernment of the sub-cluster structure and domain knowledge. This initialization gives the potential to avoid learning uninteresting or biologically irrelevant substructures. For example, if a sub-cluster has several outliers, the user will most likely initialize the sub-cluster mean on the "main body" of the sub-cluster but not on the outliers.

Models with different numbers of sub-clusters are initialized by the user and trained by the EM algorithm. The obtained partitions of all the models are displayed to the user as a reference for model selection. The MDL criterion is also utilized as a theoretical validation for model selection

where _{a }and _{k }are the number of freely adjustable parameters and the effective number of data points in the cluster, respectively, and L(**x**|**θ**_{x, k}, _{k}, **z**_{k}) is given in Equation (3). This modified MDL formula not only eases the trend to overestimate the sub-cluster number when the data size is small, but also is asymptotically consistent with the classical MDL formula introduced in section 2.1.2 of Additional file

Initialization and training of the full dimensional model

Each sub-cluster in the chosen model corresponds to a new cluster at the

where **μ**_{t, (k, j) }and **Σ**_{t, (k, j) }are the mean and covariance matrix for the **W**_{k }is the projection matrix for cluster _{k }inside the second summation, giving a mixture with components indexed by (_{k}_{j|k}. Thus we can use the transformed parameters as initialization for the SFNM model in the original data space and then further train this model using the EM algorithm to refine the parameters. Formulas for the SFNM model and the corresponding EM algorithm are given in section 2.1.3 of Additional file

Algorithm extension for sample clustering

The main clustering and visualization algorithm introduced above is directly applicable for gene clustering, which is a "data-sufficient" case due to the large ratio of gene number to sample number. Sample clustering is usually a "data-insufficient" case that suffers from the "curse of dimensionality", because in sample clustering the number of data objects to be clustered is much smaller than the data dimensionality. Many of the genes are actually irrelevant respective to the phenotypes/sub-phenotypes of interest

Two variation criteria, the variance and the absolute difference between the minimum and maximum gene expression values across all the samples, can be used to identify and then remove constantly expressed genes. For each criterion, a rank of all the genes is obtained, with genes of large variation ranked at the top.

Discrimination power analysis measures each gene's individual ability both to elicit and to discriminate clusters/components. These components are generated by fitting a 1-D SFNM model to the gene's expression values using the EM algorithm. To determine the component number, we followed the iterative procedure in

Based on the variation ranks and the discrimination power rank, a list of genes with large variations and large discrimination power can be obtained by taking the intersection of the top parts of the ranks. Further details about these three gene ranking schemes can be found in section 2.2 of Additional file

Algorithm extension for phenotype clustering

As an extension of the main clustering and visualization algorithm, phenotype clustering follows a similar hierarchical, interactive exploration process, shown in Figure _{l }phenotype clusters, each of which contains all the samples from one or multiple phenotypes. For phenotype cluster _{l}), if it contains only one phenotype, we do not need to decompose it; if it contains two phenotypes, we simply split the cluster into two sub-clusters, each containing the samples of one phenotype; if it contains more than two phenotypes, we do the following to visualize and decompose it. Let _{k }and ^{(q) }denote the number of phenotypes in cluster

The flowchart including the algorithm extension for phenotype clustering

**The flowchart including the algorithm extension for phenotype clustering**. The green blocks with dashed borders indicate the algorithm extensions, i.e. the modified visualization scheme and decomposition scheme.

Visualization of cluster

We first use supervised discriminative gene selection to form a locally discriminative gene subspace respective to the phenotype categories in the cluster. The locally discriminative gene subspace contains the most discriminative genes, where the discrimination power of a gene is measured by

where _{q }is the sample proportion of phenotype _{q }and _{q }are the mean and standard deviation of the gene's expression values in phenotype _{k}_{g}, where _{g }is the number of selected genes per phenotype, an input parameter of the algorithm. We use Discriminatory Component Analysis (DCA) to project the samples from the gene subspace onto a 2-D visualization space. Because an important outcome of phenotype clustering is the relative relationships among the phenotypes that are estimated directly based on the relative distances between samples of different phenotypes, to preserve the original and undistorted data structure, DCA here maximizes the Fisher criterion that treats all the phenotype pairs equally. The Fisher criterion is calculated based on the known phenotype categories. Maximization of the Fisher criterion is achieved by eigenvalue decomposition and the projection matrix is obtained by orthogonalizing the eigenvectors associated with the largest two eigenvalues

Decomposition of cluster

Phenotype clustering differs from sample/gene clustering in that it assigns a cluster label to each phenotype in its entirety (all samples therefrom), not to each sample/gene. Based on this difference, we use a class-pooled finite normal mixture to model the projected samples in the visualization space. Let {**x**^{(1)},..., **x**^{(q)},..., **x**^{(q) }= {^{(q)}} is the set of samples from phenotype

where cluster _{k, l+1 }sub-clusters at level _{j }and **θ**_{x, j }are the mixing proportion and parameters associated with sub-cluster

Similar to sample/gene clustering, the user is asked to initialize the sub-cluster means by pinpointing them on the visualization screen according to his/her understanding about the data structure and domain knowledge. Models with different numbers of sub-clusters are initialized by the user, and trained by the EM algorithm. The resulting partitions are shown to the user for comparison. The MDL model selection criterion is also applied for theoretical validation. Details and formulas of MDL model selection can be found in section 2.3.2 of Additional file

After visualizing and decomposing the clusters at level

A demo application of VISDA on sample clustering

To show how VISDA discovers data structure, we consider the UM microarray gene expression cancer dataset as an example and perform sample clustering

An illustration of VISDA on sample clustering

**An illustration of VISDA on sample clustering**. (a) The five different projections obtained at the top level. Red circles are brain cancer; green triangles are colon cancer; blue squares are lung cancer; and brown diamonds are ovary cancer. (b) The user's initialization of cluster means (indicated by the numbers in the small circles) and the resulted clusters (indicated by the green dashed ellipses). The left, middle, and right figures are for the models of one cluster, two clusters, and three clusters, respectively. (c) The hierarchical data structure detected by VISDA. Sub-Cluster Number (CN) and corresponding Description Length (DL) are shown under the visualization.

Figure

Results

Evaluation of VISDA

In a comparative study of clustering algorithms

Comparison of clustering performance

VISDA

HC

KMC

SOM (MSC)

SOM (CLL)

SFNM Fitting

Average mean of partition accuracy

**86.29%**

58.89%

76.47%

76.52%

79.39%

64.47%

Average standard deviation of partition accuracy

4.01%

5.03%

3.92%

**3.85%**

4.73%

5.07%

The bolded font indicates the best performance respective to a particular measure.

Because the clustering results of KMC, SOM and SFNM fitting methods may depend on initialization, for each cross-validation trial we ran them 100 times with random initialization and took the best clustering result according to the associated optimization criterion. For KMC, since its algorithm tries to minimize Mean Squared Compactness (MSC), which is the average square distance from each data point to its cluster center, we selected the result with the minimum MSC. For SOM, we separately tried minimizing MSC and maximizing Classification Log-Likelihood (CLL)

VISDA obtained the correct cluster number across the cross-validation trials with an average frequency of 97% over all the datasets. This shows that the exploration based on the hierarchical SFNM model, MDL model selection in the locally discriminative visualization space, and human-data interaction for data visualization, model initialization, and model validation, is an effective solution for model selection working on high dimensional data. VISDA outperformed all other methods in average mean of partition accuracy, which shows that VISDA's clustering result was the most accurate among the competing methods. From the average standard deviation of partition accuracy, we can see that VISDA is also a stable performer among the competing methods. Besides partition accuracy, we also evaluated the accuracy of the recovered parametric class distributions (the first and second order statistics), where the result is similar

Identification of gene clusters from muscular dystrophy data and muscle regeneration data

On a muscular dystrophy microarray gene expression dataset (Table 2 in Additional file

Analysis results of the detected gene cluster

**Analysis results of the detected gene cluster**. (a) Top scoring gene regulation network indicated by the gene cluster. Grey colour indicates that the gene is in the detected gene cluster. Solid lines indicate direct interactions. Dashed lines indicate indirect interactions. (b) The negative log

In another study

Construction of TOPs on muscular dystrophy data and multi-category cancer data

We used VISDA to cluster phenotypes in the muscular dystrophy dataset (Table 2 in Additional file _{g }(the number of selected genes per phenotype) equal to 2 is shown in Figure

The TOP found by VISDA on the muscular dystrophy dataset

**The TOP found by VISDA on the muscular dystrophy dataset**. Rectangles contain individual phenotypes. Ellipses contain a group of phenotypes.

On the 14 class MIT microarray gene expression cancer dataset with 190 samples (Table 3 in Additional file _{g }equal to 6. In each experimental trial of the leave-one-out stability analysis, one sample was left out and we constructed a TOP based on the remaining samples. Thus totally 190 TOPs were generated and we took the tree with the highest frequency of occurrence as the final solution, which best reflects the underlying stable structure of the data. As a validation, we compared the most frequent TOP to the known developmental/morphological relationships among the various cancer classes, which was published in

Forty three different TOPs occurred in the leave-one-out stability analysis. The most frequent TOP occurred 121 times; the second most frequent TOP occurred 11 times; the third most frequent TOP occurred 7 times; most of the other TOPs only occurred once. The most frequent TOP has an occurrence frequency of 121/190 ≈ 63.68%. Considering that some TOPs have only minor differences compared to the most frequent TOP, the underlying stable structure likely has even a higher occurrence frequency. We also applied VISDA on the whole dataset and obtained the same structure as the most frequent TOP. Figure

Comparison between the most frequent TOP and the pathological relationships among the cancer classes

**Comparison between the most frequent TOP and the pathological relationships among the cancer classes**. (a) Published developmental/morphological relationships among the cancer classes. (b) The most frequent TOP constructed by VISDA. Rectangles contain one cancer type. Ellipses contain a group of cancers.

Discussion

VISDA is a data analysis tool incorporating human intelligence and domain knowledge. When applied by experienced users and domain experts, VISDA is more likely to generate accurate/meaningful clustering and visualization results. Since different human-data interaction may lead to different clustering outcomes, to achieve optimum performance, the user needs to acquire experience in using VISDA on various kinds of data, especially on the dataset of interest. Multiple trials applying VISDA are suggested when analyzing a new dataset. By comparing VISDA to several popular clustering methods, we see that the clustering outcome of VISDA is stable, probably because human initialization of model parameters has the potential to improve the clustering stability compared to the random parameter initialization applied by some other methods. Notice that VISDA only requires the user to have common sense about cluster distributions, cluster separability, and outliers.

Besides the two kinds of non-informative genes discussed in the Methods section, "redundant" genes (genes that are highly correlated with other genes) provide only limited additional separability between sample clusters. However, this limited additional separability may in fact greatly improve the achievable partition accuracy

Various visualization techniques, such as dendrogram, heat maps, and projections, have been applied to present genomic data structures and clustering outcomes

One point needs to be noted is that VISDA selects a data model in the locally discriminative low dimensional visualization space. Although visualization with dimension reduction may reveal only the main data structure and lose minor/local data structures within a cluster, these minor/local structures may become the main data structure captured at subsequent levels. VISDA discovers hierarchical relationships between clusters, which allows analyzing the data at different resolutions/scales.

Larger clusters can be obtained by simply merging small clusters according to the hierarchy. The discovered hierarchical relationships among clusters may reveal important biological information, for example the developmental/morphological information revealed by TOPs. The TOP discovered by VISDA can also be used to construct a hierarchical classifier to solve the complex task of multiple diseases diagnosis by embedding a relatively simple classifier at each node of the TOP, which may obtain good classification performance

Despite our successful applications of VISDA to real microarray gene expression data, there are remaining limitations of the reported method. For example, in sample clustering, dimension reduction via unsupervised informative gene selection is highly data-dependent and often achieves only limited success. This is a very challenging task due to no prior knowledge and potentially complex gene-gene interactions embedded within high dimensional data. Furthermore, user-data interaction may bring certain subjectivity into the clustering process if not being properly orchestrated, and projection visualization may cause some unrecoverable information loss leading to only a suboptimum solution, although VISDA's hierarchical framework can partially alleviate this problem. Lastly, VISDA presently assumes each cluster follows a Gaussian distribution largely driven by mathematical convenience. However, small sample size problem can defeat this assumption and composite clusters at higher-levels of the hierarchy are not even theoretically normally distributed but are more generally mixture distributions.

Our previous publications

Conclusion

We designed a clustering and visualization algorithm for discovering structure in high dimensional genomic data. VISDA can discover and visualize gene clusters, sample clusters, phenotype clusters, and the hierarchical relationships between the detected clusters. VISDA visualizes data by structure-preserving projections and provides an interface for human-data interaction, which facilitates incorporation of expert domain knowledge and human intelligence to help achieve accurate and meaningful data visualization and modeling. The scalable and extensible VISDA framework can incorporate various existing clustering and visualization algorithms to increase the likelihood of revealing data structure of interest.

Our evaluation study based on microarray gene expression data showed that VISDA provided an effective model selection scheme for high dimensional data and outperformed several popular clustering methods, i.e. HC, KMC, SOM, and SFNM fitting, in terms of clustering accuracy. Applications to muscular dystrophy, muscle regeneration, and cancer data illustrated that VISDA produced biologically meaningful clustering results that can enhance users' understanding about the underlying biological mechanism and stimulate novel hypotheses for further research.

Abbreviations

VISDA: VIsual Statistical Data Analyzer; SOM: Self-Organizing Maps; KMC: K-Means Clustering; HC: conventional Hierarchical Clustering; MDL: Minimum Description Length; TOP: Tree Of Phenotypes; SFNM: Standard Finite Normal Mixture; PCA: Principal Component Analysis; PCA-PPM: Principal Component Analysis – Projection Pursuit Method; LPP: Locality Preserving Projection; DCA: Discriminatory Component Analysis; APC: Affinity Propagation Clustering; EM algorithm: Expectation Maximization algorithm; MSC: Mean Squared Compactness; CLL: Classification Log-Likelihood; IPA: Ingenuity Pathway Analysis; JDM: Juvenile DermatoMyositis.

Authors' contributions

YZ participated in designing and implementing VISDA, performing the experiment, and analyzing the experimental results. HL participated in developing caBIG™ VISDA. DJM participated in the technical design of VISDA on phenotype clustering. ZW implemented the prototype of VISDA. JX participated in the technical design of the software and comparative study. RC and EPH provided biological interpretation of the datasets and experimental results. YW participated in designing VISDA and the experiment, and provided technical supervision.

Acknowledgements

The authors want to thank Bai Zhang, Guoqiang Yu, and Yibin Dong for help in software implementation and experiment. This work is supported by the National Institutes of Health under Grants CA109872, NS29525, CA096483, EB000830 and caBIG™.