Phenoscape: Semantic analysis of organismal traits and genes yields insights in evolutionary biology

1 The study of how the observable features of organisms, i.e., their phenotypes, result from the 2 complex interplay between genetics, development, and the environment, is central to much 3 research in biology. The varied language used in the description of phenotypes, however, 4 impedes the large scale and interdisciplinary analysis of phenotypes by computational methods. 5 The Phenoscape project (www.phenoscape.org) has developed semantic annotation tools and a 6 gene–phenotype knowledgebase, the Phenoscape KB, that uses machine reasoning to connect 7 evolutionary phenotypes from the comparative literature to mutant phenotypes from model 8 organisms. The semantically annotated data enables the linking of novel species phenotypes with 9 candidate genes that may underlie them. Semantic annotation of evolutionary phenotypes further 10 enables previously difficult or novel analyses of comparative anatomy and evolution. These 11 include generating large, synthetic character matrices of presence/absence phenotypes based on 12 inference, and searching for taxa and genes with similar variation profiles using semantic 13 similarity. Phenoscape is further extending these tools to enable users to automatically generate 14 synthetic supermatrices for diverse character types, and use the domain knowledge encoded in 15 ontologies for evolutionary trait analysis. Curating the annotated phenotypes necessary for this 16 research requires significant human curator effort, although semi-automated natural language 17 processing tools promise to expedite the curation of free text. As semantic tools and methods are 18 developed for the biodiversity sciences, new insights from the increasingly connected stores of 19 interoperable phenotypic and genetic data are anticipated.


INTRODUCTION
There are over 20 million extant species on the planet, most of which can be described in relation to their unique and widely diverse phenotypes.Comparisons across species phenotypes, however, cannot yet readily be made using computer-assisted methods.This is because the rich legacy of comparative morphology has not yet been semantically enabled-that is, the corpus is in a free-text format that renders computation nearly impossible.This situation began to change almost two decades ago when model organism geneticists began representing the phenotypic changes resulting from experimental gene manipulations, with terms from anatomy or phenotype ontologies that they developed for each model organism (e.g., Sprague et al. 2001).More recently, the opportunity to enable interoperability from the phenotypes of biodiverse species to candidate genes from model species (Mabee et al. 2007a(Mabee et al. , 2007b) motivated the Phenoscape team to develop one of the first multispecies anatomy ontologies, the Teleost Anatomy Ontology (Dahdul et al. 2010b), based initially on the Zebrafish Anatomy Ontology (Ruzicka et al. 2015).
Developing ontologies appropriate for biodiversity, including taxonomy ontologies (Midford et al. 2013) and scaling them up first to the level of teleost fishes (Dahdul et al. 2010), then to the level of vertebrates (Dahdul et al. 2012) and then to the level of metazoans (Mungall et al. 2012;Haendel et al. 2014), further enabled the automation of phenotypic comparisons across vertebrate species and discovery of candidate genes underlying evolutionarily novel phenotypes by the team (Edmunds et al. 2016).Over the past ten years a broad community of scientists invested in the development of shared community ontologies (e.g., Gkoutos et al. 2005;Haendel et al. 2008Haendel et al. , 2014;;Dahdul et al. 2014), annotation tools (Balhoff et al. 2010(Balhoff et al. , 2014a;;Yoder et al. 2010;Cui et al. 2016; The Gene Ontology Consortium 2017) and formats (Dahdul et al. 2010a;Vos et al. 2012) for phenotype annotation across biodiverse species (Dahdul et al. 2010a).These resources have made computational analyses possible and they have been leveraged to build a wealth of innovative applications (e.g., Deans et al. 2012;Mullins et al. 2012;Balhoff et al. 2013;Dececchi et al. 2015;Manda et al. 2015;Druzinsky et al. 2016;Jackson et al. 2018) across a variety of biodiversity-based research.The Phenoscape Knowledgebase (KB) (Figure 1) demonstrates these connections by integrating gene phenotype annotations from model organism databases with phenotype annotations from the biodiversity literature (Table 1).Compelling demonstrations of the utility of semantics for biodiversity studies are important because of the large and expensive investments in infrastructure and tool development required to curate the legacy literature and move the publication of phenotypic data into a natively semantic form.To date, only a small proportion of the biodiversity literature has been annotated semantically, and no publisher, to our knowledge, tags phenotypes with ontological terms that would support interoperability.The comparative study of organismal phenotypes, however, motivates research across diverse fields of biology, including evolution, paleontology, developmental biology, agriculture, and the veterinary and health sciences (Deans et al. 2015).
The efficiency and potential of fundamental discoveries in the biodiversity arena would be dramatically expanded by the increased use of semantics

Relating biodiverse phenotypes to candidate genes
Identifying the genetic and developmental changes that brought forth the incredible phenotypic diversification of life is a recalcitrant problem, but one where a basic semantic approach has shown promise and where more sophisticated approaches using semantic similarity may yet be even more valuable.Semantic similarity enables comparison and analysis of semantic annotations between entities (genes, taxa) using ontologies and computational reasoners to compute scores that reflect the level of similarity (e.g., Washington et al. 2009

2015; see examples in Chapter 10
).The Phenoscape team showed that ontology-driven information systems can generate thousands of testable hypotheses relating unique morphologies from non-model biodiverse species to candidate genes (Mabee et al. 2012).One of these, for example, connected the unique loss of a tongue ('basihyal element') in catfishes (Siluriformes) with several candidate genes from the zebrafish data.Edmunds et al. (2016) experimentally tested the candidates by examining their endogenous expression patterns in the channel catfish, Ictalurus punctatus, and found results consistent with the in silico hypothesis that the tongue evolved through disruption in developmental pathways at, or upstream of, brpf1.
The Phenoscape team recently extended this approach (Manda et al. 2015) by using semantic similarity to find matches between the full set of phenotypes described for a gene and the unique set of phenotypes that characterizes a clade of species, i.e., an 'evolutionary phenotype profile'.The effects from a gene knockdown range from several to hundreds of phenotypes, and the goal is to compare these in their entirety to the calculated set of phenotypes that are variable among the immediate descendants of a particular taxon.Using semantic similarity, the Phenoscape KB performs fuzzy matching between suites of phenotypes, and displays the taxonomic groups that vary in phenotypes that match most closely to the gene profile that results when the action of that gene is disrupted (e.g., knocked down).The user interface provides the statistical support for each match and allows the supporting evidence to be examined.There are some important caveats that must be considered when interpreting the results, such as the potential for some matches to result from differences in annotation coverage between genetic and evolutionary studies in the KB.particular taxon under consideration?That is, a biologist who is curious about the genetic basis of taxonomic diversity might want to find genes that have phenotypes that resemble the phenotypic variation exhibited by a particular taxon.

Future applications of semantic similarity to phenotypes of biodiverse taxa
Questions of whether a particular combination of phenotypes in a taxon is unique, or what it might be similar to, are the types of broad questions that may be addressed in applying semantic similarity-based data mining to phenotypes across diverse taxa.Semantic similarity would retrieve taxa with similar phenotypic profiles; such similarity may have arisen because of common ancestry or independent origin (a 'homoplasy finder').As described by Braun et al.
(Chapter 10), predictive phenomics can, for example, be used to target desired phenotypes in species of interest -and together with recent gene editing capabilities, functional genomic analysis can be newly brought to bear on biodiverse species.The Phenoscape KB currently enables users to view taxa with variation similar to the phenotypic profile of a gene (and vice versa).In the future, they will also be able to query one custom set of phenotypes against another or a taxonomically selected subset, and obtain a ranked list of taxa with similar phenotypes.For example, miniature fishes in the genus Paedocypris, like many fishes that are evolutionarily reduced to an extremely small body size, exhibit the absence of bones including the interhyal, vomer, parietal, posttemporal, and supraneurals (Britz and Conway 2009).Are there other taxa that lack a highly similar set of bones?Enabling a comparison of these phenotypes across diverse taxa would allow a user to query for such matches; in this case, matches would include

Relating biodiverse phenotypes across studies: presence/absence
Addressing many of the questions in the biodiversity sciences involve knowing how a specific trait or set of traits has evolved across a group of species.Although the published literature is replete with research relating species and traits, and a few repositories hold phylogenetic trees, some of which are computed products from trait data, neither the traits nor the trees can be easily synthesized across studies.The OntoTrace tool was developed by the Phenoscape team (Balhoff et al. 2014b;Dececchi et al. 2015) to enable users to automatically pull together, from phenotype annotations made to published character matrices and monographic texts (Dececchi et al. 2015(Dececchi et al. , 2016)), a set of presence/absence data for specific traits for a set of taxa.For example, querying the Phenoscape KB for a supermatrix of traits of fins, limbs, girdles and their parts in sarcopterygian vertebrates (lobe-finned fishes and tetrapods), Dececchi et al. (2015)  (Uberon in this case), that a pectoral fin is present in that species (see Dececchi et al. 2015 andJackson et al. 2018 for further examples).In this manner, the missing data in the variable character subset of the matrix (the subset containing only characters that include both present and absent states) was reduced from 98.5% to 78.2%.Further, 76% of the variable characters were made variable through the addition of inferred states.The authors pointed out that character conflicts and provenance reports from OntoTrace would support researchers review of large aggregated data sets and they showed how such machine reasoning enables quantification and new visualizations of the data, allowing the identification of undersampled character space.

Relating biodiverse phenotypes to phylogenetic trees
Using ontologies and machine reasoning to automatically generate large, synthetic character matrices of presence/absence phenotypes (as per above) set the stage for the research of Jackson et al. (2018), who took this a step further.They developed a bioinformatic pipeline to propagate data that was asserted to higher-level taxonomic nodes, to descendant species that were missing data.Similar to Dececchi et al. (2015), they showed that such logic inference significantly extended the asserted data (missing data were reduced from 98.0% to 85.9%), but additionally they showed the value of taxonomic data propagation, which extended the data further, reducing missing data to 34.8% (Jackson et al. 2018).Using the resultant matrix along with a synthetic phylogeny from the Open Tree of Life (Hinchliff et al. 2015), they mapped the full trait data set for 12,582 species to the tree and addressed the question of how often paired fins were lost in teleost fishes and whether they were ever regained (Jackson et al. 2018).
Looking ahead, if all published traits and trees were made computable using these methods, any user could automatically generate a matrix for a specified set of traits and map it on various synthetic tree topologies, which in turn would allow addressing a host of questions regarding the pattern and tempo of phenotypic evolution and associations with genomic and environmental (Thessen et al. 2015) variables.

Relating biodiverse phenotypes across studies: future work
As described above, OntoTrace generates synthetic morphological supermatrices for presence/absence characters only (Dececchi et al. 2015).Expanding this functionality to automatically synthesize characters of other qualities, such as shape, size, structure, and color, is a current challenge that the Phenoscape team is addressing.For example, whereas characters in a presence/absence matrix are by definition limited to two states per character, the number of possible states for characters in other categories is a priori unconstrained.Thus, automatically synthesizing characters that, for example, describe 'basihyal bone, shape', can result in a large number of states per character because every originally published state that semantically is some type of 'basihyal bone shape' would have to be appended as a new state to the synthesized character.In the case of this example, there may be seven distinct shape terms used in its annotation (Box 1).The ontological relationships indicate that subsets of these states are more similar to each other than others.By adapting current semantic similarity metrics for the purpose of character and character state aggregation, and in effect, homology assignment, these distinct shape descriptors can be consolidated into new, synthetic states (see matrix in Box 1).Box 1. Assembling a synthetic character and its states for 'basihyal bone, shape'.
Step 1: Assemble list of 'shape' (PATO:0000052) quality terms for all characters and states from multiple publications that include the entity 'basihyal bone' (UBERON:0011618): would allow a user to constrain the number of characters in a synthesized matrix by excluding those with low information content (e.g., those for high level terms from the anatomy ontology such as 'fin'' vs. 'pectoral fin').Thus, employing semantic reasoning in matrix construction will allow a user to balance the properties of a synthetic matrix between, on the one hand, containing highly specific characters (and thus increased missing data), and on the other, including lower specificity characters (and thus decreasing missing data).
In addition to semantic tools for supermatrix construction, the Phenoscape team is developing enhanced semantics for addressing questions of trait evolution.Unlike the current tools available for analyzing molecular data, where each nucleotide site can be treated as independent of each other, evolutionary models for large morphological character matrices face significant challenges overcoming the strong conditional dependencies and correlations among morphological traits.Most existing methods ignore such dependencies and morphological characters are treated as independent.By leveraging domain knowledge relevant to assessing correlations of the traits underlying the characters, Phenoscape is developing tools that enable users to incorporate evidence of the relatedness of traits in a morphological matrix and into models of character evolution.These include measures of trait independence based on ontological relationships, distance (semantic similarity) of traits in the knowledge graph, and measures of genetic overlap (as derived from gene-phenotype annotations from model organism databases).Such dependencies can be directly built into the macroevolutionary model, or can be used to inform prior probabilities in Bayesian analyses when grouping traits into modules with shared evolutionary parameters or dynamics.
One of the challenges in conducting semantic similarity comparisons is the computational overhead of comparing EQ phenotypes over a large ontology space.Improvements in scalability effect might be greater for more distant species comparisons.Current efforts include editing and clarifying the homology relationships in the Uberon ontology and investigating how reasoning on different models of homology affects information retrieval in the KB.
Another challenge for the broader application of semantics to biodiversity data is the significant, largely manual, effort necessary to annotate phenotypes from the published literature (Dahdul et al. 2015).Natural language processing tools are needed going forward to autoannotate the legacy literature (Arighi et al. 2013;Cui et al. 2015;in prep).Further, in the future semantic phenotype data may increasingly come directly from publications, as semi-automated methods for marking up manuscripts at the time of publication become more accurate, mature, and thus prevalent.Evaluating, and hence continuously improving the accuracy of machine generated annotations depends on expert-curated "gold standard" data sets.To this end, Phenoscape has developed the first gold standard dataset for biodiversity phenotypes (in prep).
Efforts to use ontologies in the process of new species descriptions are underway (Deans et al. 2012;Balhoff et al. 2013), and will contribute to achieving a vision of widely available linked species phenotype data.
As high-throughput phenotyping, typically involving image data collection, becomes more scalable, the application of semantic metadata would enable automated connections to the tools and computable datasets described herein.These digitization efforts can be new sources of phenotype information (Figure 1).Although broad domains of biology can be served if semantics are placed on digitized images and specimens, so far only a few projects are using semantics to label digitized specimens and their parts, despite promising prototypes (Maglia et al. 2007;Rámirez et al. 2007).If anatomical parts were tagged with ontology terms, then queries on basic trait distributions could be enabled (e.g., presence of pectoral fins in taxa a, b, c... having a reduced information content compared to full Entity-Quality expressions, entity-only annotations have been shown to be informative for semantic similarity (Manda et al. 2016a).
Thus, new sources of phenotypic data, such as those for specimens of extinct and extant taxa associated with institutional collections, can easily be made interoperable through shared semantics (Figure 1).

CONCLUSIONS
Over the past 10 years the development of shared cross-species community ontology resources such as Uberon and PATO has enabled interoperability of phenotype and genotype data.This in turn enables a wealth of potential applications and discoveries from semantic analysis of biodiverse taxa.Scientific attention continues to move toward gaining a deeper fundamental understanding of the developmental and evolutionary relationship between genotype and phenotype.The profound scale and scope of this problem will not only require interoperable big data, both genomic and phenomic, from a biodiverse set of taxa, but also new ways of using machines to enable this understanding.The applications of semantic analysis described herein

Figure 1 .
Figure 1.Flow chart of currently existing data sources and tools (solid borders and lines) in the the ricefishes in the family Adrianichthyidae(Wiley and Johnson 2010), which similarly lack the interhyal, vomer, and supraneurals, and other bones such as the supracleithrum.Further, adrianichthyids may lack or possess extremely small or absent parietal bones and have PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26988v1| CC BY 4.0 Open Access | rec: 13 Jun 2018, publ: 13 Jun 2018 structurally simple posttemporal bones, which biologists may recognize as reductive phenotypes on a continuum close to 'absent'.Methods that incorporate a framework of probabilistic reasoning for phenotype relatedness (e.g., Bauer et al. 2012) have the potential to improve precision of ontology-based queries.
Apply semantic similarity to above list of PATO terms for basihyal bone.Because of higher similarity among terms, three states (0, 1, 2) are generated from the seven phenotypes: Character 1: Basihyal bone: shape Synthetic State 0: 'sharp' (PATO:0001419) (includes 'blade-like', 'pointed', 'tapered') Synthetic State 1: 'curved' (PATO:0000406) (includes 'upturned', 'curved ventral') Synthetic State 2: 'surface feature shape' (PATO:0001925) (includes 'spiny', 'folded') The Phenoscape team is now developing semantic similarity-based methods to cluster phenotypes across different character categories into characters and states, thus automating matrix construction, and enabling users to optimize the matrix for a variety of metrics.This PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26988v1| CC BY 4.0 Open Access | rec: 13 Jun 2018, publ: 13 Jun 2018 only scratch the surface of what is possible.As scientific publication moves to incorporate semantic markup of phenotype data, and semi-automated tools are improved to annotate the phenotype legacy literature, knowledge of the rich phenotypic palette of life on our planet can be exposed to machine computation with great advantage to fundamental discovery across the life sciences.interoperable resources.During the course of this work the Phenoscape project has been supported by NSF awards 1062404, 1062542, 0641025, 1661529, and the National Evolutionary Synthesis Center (NSF 0905606 and 0423641).This manuscript is based in part on work done by P.M.M. while serving at the U.S. National Science Foundation.The views expressed in this paper do not necessarily reflect those of the National Science Foundation or the United States Government.