Department of Computer Science, 660 McBryde Hall, Virginia Polytechnic Institute and State University, Blacksburg VA 24061, USA

Google Inc., 1600 Amphitheater Parkway, Mountain View CA 94043, USA

Abstract

Background

Biclustering has emerged as a powerful algorithmic tool for analyzing measurements of gene expression. A number of different methods have emerged for computing biclusters in gene expression data. Many of these algorithms may output a very large number of biclusters with varying degrees of overlap. There are no systematic methods that create a two-dimensional layout of the computed biclusters and display overlaps between them.

Results

We develop a novel algorithm for laying out biclusters in a two-dimensional matrix whose rows (respectively, columns) are rows (respectively, columns) of the original dataset. We display each bicluster as a contiguous submatrix in the layout. We allow the layout to have repeated rows and/or columns from the original matrix as required, but we seek a layout of the smallest size. We also develop a web-based search interface for the user to query the genes and samples of interest and visualise the layout of biclusters matching the queries.

Conclusion

We demonstrate the usefulness of our approach on gene expression data for two types of leukaemia and on protein-DNA binding data for two growth conditions in

1 Background

Measurement of gene expression using DNA microarrays

A number of different methods have emerged for computing biclusters in gene expression data

Organising, manipulating, and querying the potentially large number of biclusters computed by these algorithms is a data mining task in itself – one that has not been systematically addressed. In this paper, we develop a novel algorithm for laying out biclusters in a manner that visually reveals overlaps between them. We lay out the biclusters in a two-dimensional matrix whose rows (respectively, columns) are rows (respectively, columns) of the original dataset. We display each bicluster as a contiguous submatrix in the layout. We allow the layout to have repeated rows and/or columns from the original matrix, but we seek a layout of the smallest size. In addition, we develop a web-based search interface that allows the user to query the results for genes and samples of interest and visualise the layout of the biclusters that match the search criteria.

The layout algorithm is general enough to be applied to biclusters computed in real-valued, binary, or categorical data. For instance, the combination of biclustering algorithms and our layout algorithm can be used to analyze measurements of the concentrations of other types of molecules, including proteins and metabolites. We demonstrate our approach on two types of data. First, we compute layouts for biclusters extracted from leukaemia microarray data by the xMotif biclustering algorithm

Figure

An example of a bicluster layout for weather data in Blacksburg, VA

**An example of a bicluster layout for weather data in Blacksburg, VA**. Figure 1(a): a dataset in which rows represent dates and columns represent weather conditions in Blacksburg. Figure 1(b): the layout computed by our algorithm of the seven biclusters in this dataset.

The bicluster layout problem, which we formally define in Section 3.1, is very similar to the hypergraph superstring problem studied by Batzoglou and Istrail in the context of physical mapping of genomes. Batzoglou and Istrail prove that the hypergraph superstring problem is MAX-SNP Hard, i.e., it is computationally intractable to obtain a bicluster layout whose size is smaller than a constant times the optimal size. In this work, we present a heuristic that minimizes the size of the layout well in practice. In the special case when there is a solution involving no repeated rows or columns, the algorithm computes the layout of smallest size. Our algorithm runs in ^{2 }+ ^{2 }log

2 Related work

A binary matrix has the

An illustration of the COP

**An illustration of the COP**. Figure 2(a): A matrix that has the COP with the first two columns highlighted. Figure 2(b): Swapping the first two columns of the matrix demonstrates that the matrix has the COP.

Researchers have studied a number of generalizations of the COP problem; however, most of these generalizations are NP-complete or NP-Hard. For example, seeking the column ordering for a non-COP matrix that minimizes the number of gaps between the ones in each row can be reduced to the traveling salesman problem

Algorithms for constructing physical maps from hybridization data typically exploit the Lander-Waterman model

3 Algorithm

We present our approach in four stages. First, we define some useful notation. Second, we introduce the PQ-tree, a data structure that is fundamental to our approach. Third, we present our layout algorithm. Finally, we discuss its implementation and the web interface to query the computed layout.

3.1 Definitions

We denote the input matrix by

1.

2.

3. _{ij}, the element in the _{i'j'}, where

The

Given subsets

3.2 The PQ tree

Booth and Leuker _{i }to be the set of columns in _{i }be consecutive in the permutation.

A PQ tree can represent all legal permutations of _{i}, 1 ≤

An example of a PQ-tree

**An example of a PQ-tree**. An example of a PQ tree. Circles represent P nodes and rectangles represent Q nodes. Figure 3(a): Initial PQ tree

To solve the COP problem, start with an empty PQ tree _{i}). To obtain an ordering that satisfies the restrictions, perform a breadth-first traversal of

3.3 The bicluster layout algorithm

We are now ready to describe our algorithm for the bicluster layout problem. To minimize the size of

We describe the algorithm in two stages. We first transform the problem of constructing

We start by constructing a new binary matrix _{ij }is 1 if the _{ij }is 0. We can now reformulate the problem of constructing

Before describing the algorithm, we define some more notation. The leaves of each PQ tree constructed by the algorithm correspond to a subset of the columns of _{T }to denote the set of columns in a PQ tree

1. For each row _{i }and insert the restriction corresponding to row _{i}. Let

2. For every pair 1 ≤ _{i}, _{j}).

3. Compute Σ, the list of values in {_{i}, _{J}), 1 ≤

4. Repeat the following steps until Σ is empty:

(a) Remove the largest element from Σ. Let

(b) Set

(c) For each restriction

(d) Delete

(e) For each tree

(f) Insert

5. For each PQ tree _{T}.

6. Output the column layout formed by concatenating (in any order) the permutations computed in Step 5.

The algorithm starts by storing each row of

We now analyze the running time of the algorithm. Let ^{2}) similarity values takes ^{2 }+ ^{2 }log ^{2}) times. The running time of each iteration is proportional to the size of the new PQ tree constructed. A naive upper bound on this size is ^{2}). Finally, traversing all the PQ trees in ^{2 }+ ^{2 }log ^{2}), with ^{2}) required for Σ, the sorted list of similarities.

3.4 Implementation and web interface

We implemented the layout algorithm in C++ and tested it on a 2.8 GHz Pentium computer running the Fedora Core 3 operating system. Our software contains two executable programs. The first executable, layout, implements the layout algorithm. It takes a text file describing the biclusters as input and outputs the layout in a simple textual format that specifies the order of the rows and columns in the layout and the corners of each bicluster in the layout. The second executable, drawlayout, uses the computed layout and the original data set as input and produces an image corresponding to the layout.

If the input data contains a large number of biclusters, the layout may contain too many rows and/or columns for the user to navigate with ease. To alleviate this problem, we have also developed a simple web-based interface that allows the user to upload a file containing computed biclusters and a file containing the original data, and query the layout with the names of rows and columns. The interface invokes layout and drawlayout on the biclusters that contain the query rows/columns and highlights the matching biclusters, rows, and columns in the resulting layout. The interface allows the user to specify whether the data is real-valued or binary, whether the layout should contain only the matching biclusters, and whether the query should be a conjunction or disjunction of the search terms.

4 Experimental results

We present results for three types of data. We first evaluated our method on synthetic datasets. Next, we considered a binary data set encoding results of ChIP-on-chip experiments in

4.1 Synthetic data

We created synthetic datasets with different numbers of rows and columns. For each dataset, we generated biclusters by sampling subsets of rows and columns. For this experiment, we randomly generated the number of rows and columns and identifiers for the rows and columns; we did not need to generate values for the cells of the matrices. For each set of biclusters, we recorded the time required to run our layout algorithm and the number of rows and columns in the computed layout. For each layout, we estimated the

Execution times (in seconds) for the layout algorithm on synthetic matrices

#biclusters

#rows + #columns in the dataset

10

30

50

70

90

20

0.168

0.328

0.462

0.52

0.532

40

1.23

2.514

3.046

3.574

4.008

60

4.074

7.992

11.238

11.71

12.81

80

9.484

19.586

25.546

29.652

29.446

100

17.982

37.966

48.418

50.916

56.112

Efficiency values for the layout algorithm on synthetic matrices.

# biclusters

#rows + #columns in the dataset

10

30

50

70

90

20

0.184

0.842

1.316

1.254

1.428

40

0.304

1.16

1.632

2.04

2.074

60

0.398

1.496

2.262

2.26

2.508

80

0.512

1.65

2.358

2.726

2.698

100

0.48

1.808

2.582

2.686

2.996

4.2 Transcriptional regulation in S. cerevisiae

To demonstrate the ability of our visualization algorithm to highlight differences between biclusters in similar datasets, we analyzed datasets of transcriptional regulation in two experimental conditions in

The two protein-DNA datasets we study correspond to the growth of

Bicluster layouts

**Bicluster layouts**. Visualizations of the layouts computed by our algorithm. Since the layout may contain repeated rows and columns, a bicluster may appear at multiple locations in the layout. We only highlight only one occurrence of each bicluster. The layout on the left displays biclusters representing combinatorial control of transcription in

To illustrate the use of our web interface, we used it to search for biclusters that included the transcription factors RTG3 and GLN3. RTG3 is a transcription factor that forms a complex with RTG1 to activate the retrograde (RTG) and target of rapamycin (TOR) pathways

Rapamycin treatment can induce the dephosphorylation and subsequent activation of GLN3 ^{-8}, based on the hypergeometric distribution), indicating that this pathway may be activated by the three transcription factors upon rapamycin treatment.

Genes combinatorially controlled by GLN3 and RTG3

**Genes combinatorially controlled by GLN3 and RTG3**. A layout of nine biclusters of genes combinatorially controlled by GLN3 and RTG3 under exposure to rapamycin.

4.3 Classification of leukaemias

Golub et al.

5 Conclusion

The biomedical community has access to large quantities of publicly-available gene expression datasets. Biclustering has emerged as a powerful methodology for analyzing these datasets. In this paper, we have introduced a novel algorithm for laying out biclusters in a two-dimensional matrix so as to reveal the overlaps and relationships between the biclusters. The algorithm performs efficiently in practice. We have demonstrated the applicability of the algorithm to three important problems in bioinformatics using both binary and real-valued data. An easy-to-use web interface distributed with the layout software allows the user to query and navigate layouts that are too large to study manually. Biclustering is useful not just for processing gene expression data but for any dataset that measures the relationships between two different types of data, e.g., genes and functions; microRNAs and their target mRNAs; and genes and diseases. Thus, our algorithm has the potential to be useful for a wide variety of bioinformatic applications.

Authors' contributions

TMM posed the problem to GG. GG developed and implemented the algorithm and performed the experiments with guidance from TMM. AM implemented the web interface. GG and TMM wrote the paper.