Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA

Department of Mathematics, Virginia Tech, Blacksburg, VA 24061, USA

Abstract

Background

The Structural Classification of Proteins (SCOP) database uses a large number of hidden Markov models (HMMs) to represent families and superfamilies composed of proteins that presumably share the same evolutionary origin. However, how the HMMs are related to one another has not been examined before.

Results

In this work, taking into account the processes used to build the HMMs, we propose a working hypothesis to examine the relationships between HMMs and the families and superfamilies that they represent. Specifically, we perform an all-against-all HMM comparison using the HHsearch program (similar to BLAST) and construct a network where the nodes are HMMs and the edges connect similar HMMs. We hypothesize that the HMMs in a connected component belong to the same family or superfamily more often than expected under a random network connection model. Results show a pattern consistent with this working hypothesis. Moreover, the HMM network possesses features distinctly different from the previously documented biological networks, exemplified by the exceptionally high clustering coefficient and the large number of connected components.

Conclusions

The current finding may provide guidance in devising computational methods to reduce the degree of overlaps between the HMMs representing the same superfamilies, which may in turn enable more efficient large-scale sequence searches against the database of HMMs.

Background

The Structural Classification of Proteins (SCOP) database is a comprehensive protein database that organizes and classifies proteins based on their evolutionary and structural relationships

Apart from the hierarchical classification and organization of proteins, the SCOP database employs hidden Markov models (HMMs) to represent superfamilies

Because each superfamily might be represented by multiple HMMs, there may be a high degree of overlap and redundancy among the models. However, there have not been any studies examining this issue systematically. To understand how the HMMs in the SCOP database are related to one another and the degree of overlap or redundancy among HMMs from either the same or different superfamilies, we perform a detailed analysis of the HMMs in SCOP for their similarity and relationships using a network approach. Specifically, we perform an all-against-all HHsearch for the library of HMMs in the SCOP database.

HHsearch is similar to BLAST, except that instead of matching a sequence against a database of sequences, it uses a query HMM or sequence to match against a database of HMMs and identifies the HMMs significantly homologous to the query HMM or sequence

Results and Discussion

General statistics of the HMMs and their network

A general description of the HMMs used to construct the network is shown in Table

The general statistics of the HMM library

**Class**

**Number of HMMs**

**Number of folds**

**Number of superfamilies**

**Number of families**

a

1975

157

262

506

b

2590

109

231

485

c

3391

120

194

686

d

2932

223

328

683

e

199

34

34

51

f

145

29

44

50

g

697

49

70

112

All

11929

721

1163

2573

The entire HMM network is shown in Figure

The HMM network

**The HMM network**.

Size distribution of connected components

**Size distribution of connected components**. The CC size ranges from 2 to 590, with median 3 and mean 7.8.

The 20 largest connected components and their densities

**Size rank**

**Number of vertices**

**Density**

1

590

0.12

2

349

0.21

3

277

0.65

4

155

0.15

5

141

0.38

6

121

0.33

7

120

0.19

8

106

0.72

9

99

0.84

10

90

0.95

11

86

0.99

12

85

0.89

13

81

0.32

14

80

0.83

15

74

0.66

16

73

0.65

17

72

0.16

18

70

1.00

19

69

0.97

20

66

0.40

All

11929

0.002

Degree distribution

The distribution of the degrees of the HMM network is shown in Figure

The distribution of the degrees of the HMM network

**The distribution of the degrees of the HMM network**.

Log-log degree distribution

**Log-log degree distribution**. The log base is 2. The best fitting quadratic curve is 3.2481 - 0.176557^{2}.

Network Density

Density, computed as the number of edges over the number of all possible edges (in a fully connected graph), provides some quantitative evaluation on the connectivity of a network. The density of the entire network is low, only 0.002

The density distribution of CCs

**The density distribution of CCs**. CCs with size two are excluded from the distribution.

Thus, individual CCs tend to have very high connectivity, whereas the entire network is not well connected. The density of the 20 largest CCs is shown in Table ^{-16 }for CC size > 2).

Vertex centrality

Vertex centrality measures the "importance" of a vertex. Two centrality metrics, degree and betweenness, were computed for the vertices in the entire HMM network. The top 20 HMMs that have the highest degrees all belong to the same superfamily, b.1.1, Immunoglobulin, and also to the third largest CC that has 277 vertices. Thus, these 20 HMMs are connected with almost all other HMMs in the third CC. The HMM d1n26a1 (SCOP ID b.1.1.4, (A:1-93)) has the highest degree, 268, belonging to the Interleukin-6 receptor alpha chain, N-terminal domain. Table

The 20 HMMs with largest betweenness

**Rank**

**HMM ID**

**SCOP ID**

**Betweenness**

1

d1bg6a2

c.2.1.6

14915.8

2

d1o8ca2

c.2.1.1

14665.7

3

d1e5qa1

c.2.1.3

14504.0

4

d2bzga1

c.66.1.36

9557.9

5

d3bswa1

b.81.1.8

9168.0

6

d1vj0a2

c.2.1.1

8211.0

7

d1ks9a2

c.2.1.6

7469.9

8

d2bmfa2

c.37.1.14

7439.8

9

d2dt5a2

c.2.1.12

7410.7

10

d1pjca1

c.2.1.4

7325.1

11

d1gtea4

c.4.1.1

7165.3

12

d1gu7a1

b.35.1.2

6768.0

13

d1tt7a1

b.35.1.2

6768.0

14

d2f1ka2

c.2.1.6

5985.2

15

d1ebfa1

c.2.1.3

5959.8

16

d1jqba2

c.2.1.1

5313.1

17

d1gr0a1

c.2.1.3

5220.0

18

d1ye8a1

c.37.1.11

5207.7

19

d1piwa2

c.2.1.1

4556.8

20

d1hdoa_

c.2.1.2

4403.8

Because the entire HMM network contains many CCs, among which there are no connections, we computed three centrality measurements (degree, betweenness, and closeness) for the 20 largest CCs. Table

The top 2 HMMs with the highest centrality measurements for the 20 largest CCs.

**CC**

**HMM**

**B**

**HMM**

**C**

**HMM**

**D**

1

d1bg6a2 (c.2.1.6)

14915.8

d1bg6a2 (c.2.1.6)

0.51

d1e5qa1 (c.2.1.3)

222

1

d1o8ca2 (c.2.1.1)

14665.7

d1e5qa1 (c.2.1.3)

0.50

d1bg6a2 (c.2.1.6)

183

2

d2bmfa2 (c.37.1.14)

7439.8

d1ye8a1 (c.37.1.11)

0.69

d1ye8a1 (c.37.1.11)

219

2

d1ye8a1 (c.37.1.11)

5207.7

d2i3ba1 (c.37.1.11)

0.65

d1bifa1 (c.37.1.7)

206

3

d1gsma1 (b.1.1.4)

546.2

d1n26a1 (b.1.1.4)

0.97

d1n26a1 (b.1.1.4)

268

3

d1l6za2 (b.1.1.4)

514.7

d1f2qa1 (b.1.1.4)

0.96

d1f2qa1 (b.1.1.4)

265

4

d1tqja_ (c.1.2.2)

1931.5

d1yxya1 (c.1.2.5)

0.56

d1y0ea_ (c.1.2.5)

71

4

d1izca_ (c.1.12.5)

1712.9

d1y0ea (c.1.2.5)

0.56

d1gtea2 (c.1.4.1)

68

5

d1wjka_ (c.47.1.1)

1042.1

d1a8la2 (c.47.1.2)

0.69

d1a8la2 (c.47.1.2)

88

5

d1r7ha_ (c.47.1.1)

683.6

d1f9ma (c.47.1.1)

0.69

d1ep7a_ (c.47.1.1)

87

6

d1gjwa2 (c.1.8.1)

1318.6

d1ecea (c.1.8.3)

0.65

d1ecea_ (c.1.8.3)

76

6

d1bf2a1 (b.1.18.2)

1199.0

d1qnra (c.1.8.3)

0.61

d1qnra_ (c.1.8.3)

75

7

d1jhfa1 (a.4.5.2)

2369.2

d2d1ha1 (a.4.5.50)

0.53

d1ub9a_ (a.4.5.28)

58

7

d1fsea_ (a.4.6.2)

1988.3

d1sfxa (a.4.5.50)

0.52

d2d1ha1 (a.4.5.50)

55

8

d1tcaa_ (c.69.1.17)

390.0

d1tcaa (c.69.1.17)

0.93

d1tcaa_ (c.69.1.17)

97

8

d1ispa_ (c.69.1.18)

167.6

d1b6ga (c.69.1.8)

0.92

d1b6ga_ (c.69.1.8)

96

9

d1cd9b1 (b.1.2.1)

224.1

d1bqua1 (b.1.2.1)

0.95

d1cd9b1 (b.1.2.1)

95

9

d2c4fu1 (b.1.2.1)

193.0

d1cd9b1 (b.1.2.1)

0.95

d1bqua1 (b.1.2.1)

93

10

d1wg4a_ (d.58.7.1)

14.8

d1wg4a (d.58.7.1)

1.00

d1wg4a_ (d.58.7.1)

89

10

d1whya_ (d.58.7.1)

13.8

d1fxla1 (d.58.7.1)

0.99

d1fxla1 (d.58.7.1)

88

11

d1p3wa_ (c.67.1.3)

0.4

d1p3wa (c.67.1.3)

1.00

d1p3wa_ (c.67.1.3)

85

11

d1fg7a_ (c.67.1.1)

0.4

d1fg7a (c.67.1.1)

1.00

d1fg7a_ (c.67.1.1)

85

12

d1tiza_ (a.39.1.5)

175.3

d1tiza (a.39.1.5)

0.98

d1tiza_ (a.39.1.5)

82

12

d1 fi5a_ (a.39.1.5)

68.1

d1 5a (a.39.1.5)

0.97

d1rroa_ (a.39.1.4)

81

13

d1onwa1 (b.92.1.7)

362.1

d1ra0a2 (c.1.9.5)

0.66

d1ra0a2 (c.1.9.5)

42

13

d2bb0a1 (b.92.1.10)

252.0

d1nfga2 (c.1.9.6)

0.64

d1i0da_ (c.1.9.3)

41

14

d1agja_ (b.47.1.1)

132.4

d1agja (b.47.1.1)

0.98

d1agja_ (b.47.1.1)

77

14

d1l1ja_ (b.47.1.1)

132.4

d1l1ja (b.47.1.1)

0.98

d1l1ja_ (b.47.1.1)

77

15

d1yvka1 (d.108.1.1)

225.4

d1wwza1 (d.108.1.1)

0.85

d1wwza1 (d.108.1.1)

63

15

d1vhsa_ (d.108.1.1)

148.8

d1bo4a (d.108.1.1)

0.85

d1bo4a_ (d.108.1.1)

63

16

d1qhqa_ (b.6.1.1)

74.3

d1e30a (b.6.1.1)

0.94

d1e30a_ (b.6.1.1)

67

16

d1e30a_ (b.6.1.1)

65.1

d1kcwa2 (b.6.1.3)

0.90

d1kcwa2 (b.6.1.3)

64

17

d1huxa_ (c.55.1.5)

1183.4

d1huxa (c.55.1.5)

0.63

d1huxa_ (c.55.1.5)

38

17

d2ch5a1 (c.55.1.5)

341.4

d2ewsa1 (c.55.1.14)

0.54

d2ewsa1 (c.55.1.14)

28

18

d1rgwa_ (b.36.1.1)

0.0

d1rgwa (b.36.1.1)

1.00

d1rgwa_ (b.36.1.1)

69

18

d1t2ma1 (b.36.1.1)

0.0

d1t2ma1 (b.36.1.1)

1.00

d1t2ma1 (b.36.1.1)

69

19

d1j7la_ (d.144.1.6)

2.5

d1j7la (d.144.1.6)

1.00

d1j7la_ (d.144.1.6)

68

19

d1zara2 (d.144.1.9)

2.5

d1zara2 (d.144.1.9)

1.00

d1zara2 (d.144.1.9)

68

20

d2fug34 (d.58.1.5)

1050.3

d2fug34 (d.58.1.5)

0.63

d2fdna_ (d.58.1.1)

32

20

d3c8ya2 (d.15.4.2)

1045.0

d3c8ya2 (d.15.4.2)

0.59

d7fd1a_ (d.58.1.2)

32

For each row, the columns refer to the rank of the CC based on its size, the HMMs (SCOP IDs in the parenthesis) with the largest or second largest centrality measured by betweenness (B), closeness (C), and degree (D).

The results show that from the entire network, the vertices with the highest degrees do not necessarily have the highest betweenness, and vice versa. Degree measures how many immediate neighbors one HMM has, and therefore, the more it has, the more central it is. The vertices with the 20 largest degrees are all from the third largest CC, and are connected to about 94% of its vertices. The vertices with the 20 largest betweenness are from either the largest CC or the second largest CC. Since betweenness reflects how essential one vertex is to the connection of any other two vertices in the graph, in the case of HMMs, it may reflect the possibility that one HMM is the

Network diameter

The diameter of the largest CC (containing 590 vertices) is 9. The average distance between the vertices in the component is 2.94. This bears some similarity to the yeast protein interaction network

We also measured the diameters of all the CCs to see how they change as a function of CC size. Figure

Boxplot for the diameter of CCs as a function of CC size

**Boxplot for the diameter of CCs as a function of CC size**. The box marks the lower and upper quantile of CC sizes with the same diameter, the dark line marks the median, the whiskers mark the border of lower and upper outliers with the dots outside denoting the outliers.

The effect of e-value cutoff on the network

As the e-value measures the degree of similarity between two HMMs, we examined how changing e-value cutoff affects the general properties of the network, such as the number and sizes of CCs. Figure ^{-18 }(the slight drop for e-value cutoffs of 10^{-19 }and 10^{-20 }is due to the exclusion of CCs of size 1). Similar patterns are observed when only CCs that are greater than size two, three, and four are considered, generally, the number of CCs increases with more stringent e-value cutoffs. To see what specific sized CC groups are more affected by the stringency of e-value cutoffs, the CC size distribution was also studied as a function of e-value cutoffs. Figure

The number of CCs of size > 1 as a function of e-value cutoff

**The number of CCs of size > 1 as a function of e-value cutoff**.

CC size distribution as a function of e-value cutoff

**CC size distribution as a function of e-value cutoff**. For clarity, only the distributions for some e-value cutoffs from 10^{-20 }to 10^{-3 }are shown.

Figure

The 20 largest connected components and e-value

**The 20 largest connected components and e-value**. For clarity, only the curves for some e-value cutoffs from 10^{-20 }to 10^{-3 }are shown.

CCs and SCOP hierarchy

Within the CCs, we examined whether the HMM members are from the same family, superfamily, fold, or class. There are altogether 1178 CCs whose members have the same SCOP domain classification (conserved at all hierarchical levels), 271 CCs whose HMMs belong to the same superfamily but to different families, 24 whose members belong to the same fold, but to different superfamilies, 18 whose members belong to the same class but have different folds, and the remaining 33 whose members are from different classes.

The consistency between the prediction of HMM memberships at different hierarchical levels in the SCOP database based on the e-value cutoffs and the classification of the SCOP database was evaluated by ROC curves, shown in Figure

The ROC curves

**The ROC curves**. The ROC curves for family, superfamily, fold, and class with different e-value cutoffs. For each curve, the data points from left to right correspond to the FPR and TPR for the e-value cutoffs from 10^{-20 }to 10^{-3}.

Because fold and superfamily show similar classifications, we focused on studying the superfamilies further. In order to see how the superfamilies are represented in terms of connected components, we examined the number of HMMs representing the 1163 superfamilies to see how many CCs the HMMs are dispersed into. Table

Functional annotation of the top ten superfamilies that have either the largest number of HMM representations or CCs.

**Superfamily ID**

**# of HMMs**

**# of CCs**

**Functional annotation**

c.37.1

358

3

P-loop containing nucleoside triphosphate hydrolases

b.1.1

286

6

Immunoglobulin

c.2.1

267

2

NAD(P)-binding Rossmann-fold domains

a.4.5

150

20

Winged helix DNA-binding domain

c.47.1

147

4

Thioredoxin-like

c.1.8

141

7

(Trans)glycosidases

c.66.1

119

2

S-adenosyl-L-methionine-dependent methyltransferases

a.4.1

110

8

Homeodomain-like

c.69.1

106

1

alpha/beta-Hydrolases

b.1.2

98

2

Fibronectin type III

**Superfamily ID**

**# of CCs**

**# of HMMs**

**Functional annotation**

a.4.5

20

150

Winged helix DNA-binding domain

b.1.18

17

76

E set domains

b.40.4

16

95

Nucleic acid-binding proteins

b.29.1

14

97

Concanavalin A-like lectins/glucanases

d.14.1

11

52

Ribosomal protein S5 domain 2-like

g.39.1

10

83

Glucocorticoid receptor-like (DNA-binding domain)

b.18.1

10

54

Galactose-binding domain-like

d.3.1

10

54

Cysteine proteinases

a.4.1

8

110

Homeodomain-like

b.121.4

8

58

Positive stranded ssRNA viruses

The working hypothesis

Taking into account the processes that built the HMMs and the hierarchical classification of the HMMs in the SCOP database, we hypothesize that the network should reflect this process, i.e.,

However, to formally evaluate this and provide some statistical support, we also simulated 10,000 random networks, while preserving the degree distribution and the number and sizes of connected components. Each random network has the same number of connected components as our original network, and the working hypothesis predicts that the connected components of such a network have a lower degree of conservation in the family and superfamily assignment. Among the 10,000 simulated random networks, the highest proportions of CCs having only members from the same family and superfamily are as low as 0.5% and 0.7%. This shows that in the observed network, the HMMs from the same family or superfamily do have a strong tendency to cluster, agreeing with our working hypothesis.

Comparison with other networks

It is evident that the HMM network is highly clustered. In fact, its clustering coefficient is 0.85, which, to our knowledge, seems to be the highest among the biological networks that have been studied so far. As shown by Newman

Conclusions

In this paper, we examined the properties of the network constructed for HMM models in the SCOP protein structural classification database. A number of questions remain to be addressed in future research. For example, can we devise a computational method to measure or evaluate the degree of redundancy or overlap between HMM models that are used to represent the same superfamily? This research is meaningful given the ever increasing number of large-scale genomic sequences (therefore more protein sequences). Given that we can measure the redundancy of the HMMs of a superfamily, the logical question becomes, can we computationally reduce the redundancy of the HMM library, e.g., possibly by constructing super-HMMs, each of which represents a collection of redundant HMMs, so that a protein sequence is scanned against a reduced set of HMMs (super-HMMs) rather than the entire set of HMMs that have overlaps and redundancies? Finally, because the HMM network shows distinct properties from many documented networks as discussed above, can we propose a theoretical model to better account for the observations in the current network? Moreover, as our HMM network is also weighted, with edges quantifying the similarity between two HMMs, future proposed models can also consider the incorporation of weighted edges into the network.

Methods

The SCOP library of HMMs (scop70_1.75.hhm.tar.gz) was downloaded from the website

HHsearch

To study the relationship of the HMMs, an undirected network

introduced in Freeman

The closeness centrality measures the number of steps required to access every other vertex from a given vertex, specifically, the closeness of a vertex

where _{a, i }

The network clustering coefficient, C, also known as transitivity, measured by the ratio between the number of triangles and the number of connected triplets, was computed for the entire network. The number of connected components that are trees, where there are

To systematically study the consistency between the e-value cutoffs for the prediction of whether or not HMMs belong to the same hierarchical level and classification of the SCOP database, we examined the Receiver Operating Characteristic (ROC) curves for the prediction of the hierarchical categories of two HMMs provided by different e-value cutoffs. The ROC curve shows how the true positive rate changes with the false positive rate for a classification. Specifically, for example, at the family level, if a sample of two HMMs were classified to the same family by the SCOP database, the prediction based on a specific e-value cutoff is considered to be a false negative (FN) if the e-value similarity of the two HMMs is worse/higher than the e-value cutoff, a true positive (TP) if the e-value is better (i.e., lower) than the cutoff, if the two HMMs were not classified to the same family by the SCOP database, the prediction based on the specific e-value cutoff is considered to be a true negative (TN) if the e-value similarity of the two HMMs is worse/higher than the e-value cutoff, a false positive (FP) if their e-value is better (i.e., lower) than the cutoff. Similar rules were applied to classify each pair of HMMs into the four categories (TP, FP, FN, and TN), for the four hierarchies, class, fold, superfamily, and family. True positive rate (i.e., sensitivity) was calculated as

and false positive rate (ie., 1 - specificity) as

An ROC curve was plotted for the four levels (i.e., class, fold, superfamily, and family) with different e-value cutoffs ranging from 10^{-20 }to 10^{-3}.

Authors' contributions

LZ, LTW, LSH conceived the research. LZ analyzed the data and wrote the manuscript. LZ, LTW, LSH revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors thank T Murali for discussion. The work was partially supported by USDA 2009-35205-05221 and AFRL Grant FA8650-09-2-3938 and AFOSR Grant FA9550-09-1-0153.