Quantifying the Specificity of Gene Ontology Terms
1,2,3,452,3,4Gil Alterovitz, Michael Xiang, and Marco F. Ramoni 1Division of Health Sciences and Technology, Harvard University and Massachusetts Institute of Tech-2nology, Boston, MA. Department of Electrical Engineering and Computer Science, Massachusetts In-3stitute of Technology, Cambridge, MA. Children’s Hospital Informatics Program, Boston, MA. 4Harvard Partners Center for Genetics and Genomics, Harvard Medical School, Boston, MA. 5Department of Biology, Massachusetts Institute of Technology, Cambridge, MA.
Harvard Medical School
New Research Building, Room 250
77 Avenue Louis Pasteur
Boston, MA 02115
Running Title: Quantifying Specificity of Gene Ontology Terms Keywords: Gene Ontology, Probabilistic Methods, Information Theory
An ever-increasing amount of information is being analyzed with the help of hierarchical ontologies, such as the Gene Ontology (GO). While lower levels in hierarchies like GO generally increase in speci-ficity, information content of nodes across a single ontology level is not uniform- which may bias any analysis assuming a direct correspondence between ontology level and node specificity. This can lead to incorrect conclusions and reduction in gene enrichment/analysis discovery potential due inefficient se-lection of terms for analysis. Ontology partitions represent a new method by which to select a set of on-tology nodes (e.g., GO terms) having similar specificity with the aid of information theoretic concepts. We applied this approach within the Gene Ontology and validated that our method provides sets of nodes that are closer in information content to the theoretical ideal, when compared to sets of nodes de-
-5rived from the graphical structure of the ontology (p < 1x10).
The complexity of biological data has necessitated the creation of hierarchical ontologies. Conse-quently, there is growing interest in the field of ontology research. According to cBiO, there are over 40 Open Biomedical Ontologies (OBO) that have been or are under development (www.bioontology.org). Such ontologies often have thousands of nodes that increase in number with subsequent updates. As
1such, the actual ontologies lend themselves to analysis in order to be used more effectively .
2, 3The Gene Ontology is one of the classic hierarchical ontologies used in genome research, com-prising approximately twenty thousand terms. It is a direct acyclic graph whose nodes represent terms dealing with molecular functions, cell components, or biological processes; edges connecting nodes de-lineate dependence relationships. The Gene Ontology has been widely used in genome research applica-
tions ranging from, among others, predicting function from annotation patterns to predicting biological
4-7processes based on temporal gene expression .
A number of methods have been developed to analyze gene enrichment using GO, including FatiGO 8910, GeneInfoViz , and DAVID . These tools greatly contributed to the field in terms of integrating on-tology-based information with genomic analysis and often use variants of the Fischer exact test to test whether differentially expressed genes can be best categorized in a given GO term. To determine the level of specificity, they allow input of the hierarchical level (within the GO direct acyclic graph) at which analysis is done. These methods typically use metrics such as shortest and/or longest paths, given n steps (where n is the level of specificity). The implicit assumptions of these methods are that GO le-vels correlate with specificity and that terms within the same GO level are of similar information content. Additionally, this approach also presents inherent imprecision, because it must conform to discrete GO term levels, rather than the actual degree of information (which is continuous, not discrete) contained in the GO terms themselves. Figure 1 shows how levels under the “biological_process” branch (GO
ID:8510) are defined here.
One common issue in hierarchical ontologies is deciding the level of specificity to use in the analysis. In GO, gene expression analysis can be done at the level of “macromolecular metabolism,” a relatively
general category, or “terpene metabolism,” a very specific category. This issue of ambiguous term spe-
cificity and ontology design has been cited previously in the literature as having hindered genomic anal-
11, 12ysis methods and their performance . On the one hand, analysis using GO terms that are too general
may overlook significantly represented biological markers because many genes in the background ge-nome are also annotated by the general GO terms. In contrast, the use of GO terms that are too specific for the application at hand can result in the same problem, because too few (perhaps zero) genes in the data set are annotated by the GO terms used in the analysis.
13, 14Information theory has shown that distributing objects (e.g. genes, proteins) evenly across a set of bins (e.g. GO terms) maximizes the information that can be gained about the system in a random ob-servation. Here, we have developed an information-based framework for dividing an ontology into sets of nodes that have a uniform level of information. A set of such nodes, therefore, partitions the informa-tion in the GO ontology into terms having similar information content.
In order to implement our information theory approach, we required a method of calculating the amount of information represented by a node in the Gene Ontology. Intuitively, a node that annotates a large number of genes provides little information about the gene. For example, the GO node “cellular process,” which annotates approximately 40% of human genes, reveals very little about the actual bio-
logical function of a gene. On the other hand, nodes observed rarely among genes provide greater amounts of information. Thus, the GO node “carbohydrate metabolism,” which annotates fewer than 2% of human genes, provides a much clearer, more precise description of gene function. Mathematically, the information content of a GO node correlates inversely with the frequency of its annotation. More ex-
15plicitly, the information content (in bits) of a GO node V is the “surprisal,” or self-information of the n
node (see “Methods” section for details).
Figure 2 provides the information content (in bits) of selected GO nodes in the context of the human genome (SwissProt/TrEMBL annotation). A larger number of bits indicates a higher level of in-formation; annotation by the GO node conveys a higher amount of description and specificity. Since bit-wise information is defined by log base 2, an increase in one bit of information indicates a two-fold in-crease in descriptive specificity.
In this work, an information theoretical framework is used to quantify how much information is in each ontology term, using GO as a demonstrative application. Since an ontology can be represented by a graph, our goal is to select a subset of n ontology term nodes with similar information content. As n
increases, the specificity of elements in these sets should increase as well. These nodes need to cover all
potential genes (collectively exhaustive), yet not overlap (mutually exclusive). In this work, we first we describe the method for finding GO sets of terms having similar specificity. Then, we validate this ap-proach by comparing our method to traditional approaches that assume GO term set uniformity based on the graphical structure of the ontology. Finally, we apply this framework to cellular pathway analyses to enable visual gene enrichment.
Quantifying GO Specificity
Using an information theoretic approach, Figure 1 illustrate the GO partition nodes, in the context of the GO DAG (directed acyclic graph), chosen for GO partitions consisting of 12 nodes. Here we have chosen to restrict selection of GO partition nodes to beneath the “biological_process” node (see Figure
S1 for a 4 node partition example). Figure 1 and Figure S1 indicate that although the information con-tents of the GO partition nodes in each figure are similar, they may come from very different “GO le-
8-10vels” of the GO DAG, as defined by the “longest-path” approach used by others . We thus decided to
compare the standard deviation of information content for GO nodes chosen by the GO partition method and the GO level method. In Figure 3, “biological_process” served as the root node. Figure 3.a. shows
the standard deviation of information content for nodes comprising GO levels 1-5 (which are used by DAVID) and for nodes comprising GO partitions of varying size. GO levels 1-5 consist of 11, 80, 383, 878, and 1340 nodes, respectively. Figure 3.b. shows the average information content for the two me-thods.
The optimal information content per node for a set of n nodes is defined using an inverse relation: a
gene chosen at random would be expected to be annotated by one node from the set of n nodes. Thus, as
n grows larger, each node in the set is expected to become more specific. Put another way, each node has probability 1/n of describing a randomly-chosen gene (see Methods).
The results for all levels are shown in Figure S2. We found that the average information content for the GO partition approach was significantly closer to the optimal information curve as compared to the
-5traditional GO level-based approach (p < 1x10, see methods section). In addition, the GO partition-
based approach resulted in significantly lower variability (p < 0.01) in the information within each set compared to that of the GO level approach across a majority (9 out of 14) of the levels.
Figure 3 indicates that the information content of GO terms for a given GO level initially rises more steeply and is at all points higher than the information content of GO partition nodes. Since the informa-tion content of GO partition nodes is selected such that a gene will be expected to be annotated by one GO partition, the higher level of information of nodes at a given GO level compared to an equal number of GO partition nodes is significantly less than optimal. The higher amount of information suggests that GO level nodes are too specific and detailed for the number of nodes at a GO level, and thus GO enrichment may be overlooked because the specificity of the analyzed GO terms is not appropriate. By contrast, Figure 3 shows that the information content of GO partition nodes (red curve, with standard deviation error bars) matches well the optimal information content (blue curve). At larger numbers of GO partition nodes, the mean information content is slightly above the optimal information content; this phenomenon is likely a result of the GO partition node selection algorithm, which excludes ancestors and descendants of nodes already added. The general branching structure of the GO DAG means that ancestors are sparser than descendants. Therefore, descendants are more likely to be selected as GO par-tition nodes as the selection algorithm advances, resulting in an apparent preference for higher, rather than lower, information content.
Gene Enrichment Application
Finally, allowing the user to specify the number of GO nodes to select makes possible the visual par-titioning of genes, as in an interaction or regulatory network. For instance, each GO partition node of a GO partition may be assigned a color, and genes are “colored” if they are annotated by a particular GO
partition node. Whereas the use of GO levels quickly leads to an intractable number of GO nodes—in
and Figure 3.b.—our method provides a way of se-the hundreds or thousands, as shown in Figure 3.a.
lecting a manageable number of GO nodes, such as 10 or 20, which can be used to partition a graph, color a network, test for enrichment, or deduce the most representative and relevant GO terms at a par-ticular level of information. In addition, the user is not able to select the number of GO nodes for visual partitioning with the use of GO levels, but is easily able to select the size of a GO partition for analytical purposes. Of course, the user can also choose to select hundreds of nodes by our information-based ap-proach, which would offer the level of scope provided by current use of GO term levels combined with greater precision and consistency of information.
We conclude this section by giving some examples of the applications discussed above. Figure 4 (contrast with Figure S4) illustrates two cases where GO partitions have been applied to a biologically significant group of genes. In Figure 4.a., secretin-like class B G-protein coupled receptors were been
analyzed by a GO partition comprising five nodes derived from “biological_process” as root. An edge
between a GO node and a gene indicates that the gene is annotated by that GO node. Visual enrichment of “cell communication” is immediately apparent, which is confirmed to have a highly-significant p-
-18value of 1.22x10. By contrast, this class of receptors is only very sparingly involved in “protein meta-
bolism,” “establishment of localization,” “regulation of physiological process,” and “regulation of cellu-
lar process.” In Figure 4.b., the proteins involved in the proteasome pathway were similarly analyzed ith a GO partition of six nodes derived from “biological_process” as root. Three of the proteins were w
not annotated by any of the six GO partition nodes, and thus were assigned to “other.” Again, visual enrichment is immediately observed in “cellular protein metabolism” as well as “biopolymer metabol-
-4ism.” Indeed, the enrichment p-values for these GO terms are 6.41x10 and 0.043, respectively. By con-
trast, genes involved in the proteasome pathway appear under-represented for transport, signal transduc-tion, and metabolism of nucleotides and related molecules. In addition, the GO term “regulation of me-
tabolism” is not connected to any of the genes, indicating that genes of the proteasome pathway mainly serve as regulatory targets and in processing roles, rather than affecting metabolic regulation themselves. Therefore, the use of GO partitions here has been used to clarify the functional significance of various pathways and gene families in a visually striking manner.
Moreover, a GO partition may be used to “color” a graph or network of genes by assigning each GO
partition node a color. Such a graph may be based on gene regulation relationships, protein-protein inte-ractions, or concurrency of metabolic pathways. Figure 5 illustrates the use of a seven-node GO partition to color genes involved in the bone morphogenetic pathway. An edge between two genes indicates an interaction between their protein products. The graph reveals that most of the genes with direct interac-tions are “colored” similarly, whereas genes not sharing an edge are less similar.
An Information Theoretic Paradigm
We proposed that our method represents a more balanced and consistent approach of focusing on a certain level of specificity of GO annotation than the current and conventional method of relying on GO level. This idea was confirmed by comparing the standard deviation of information content (in bits) of GO nodes at a specific GO level with the standard deviation in information content of an equal number of GO nodes chosen by the information theory-based GO partition algorithm (Figure 3). As expected, GO partition nodes offer more consistent levels of information across GO nodes than GO level nodes due to the lower standard deviation in bits of information across GO nodes.
Figure 3.b. demonstrates that, for GO partitions consisting of 1 to 100 nodes, the information content of GO partition nodes closely matches the optimal information content. In addition, whereas the use of “GO levels” locks the user into discrete levels that contain fixed numbers of GO nodes, our method is much more flexible in allowing the user to specify the number of GO nodes to select, which in turn al-
lows precise selection of the desired level of information of the selected GO nodes. Thus, the user is not forced to conform to pre-determined GO levels, but instead is free to choose from the continuous range of information content represented by nodes in the Gene Ontology. In these ways, our method solves both the inconsistency and imprecision that result from relying on GO levels.
Using the information theoretic-based approach, significant patterns (e.g. gene enrichment) can be exposed that would be missed using the traditional GO level-based approach- as shown in the gene enrichment examples. Optimization based on GO term information reduces the need for multiple test corrections (which increases the p-value, resulting in potentially missed discoveries). The effect of this is that the partition approach empowers investigators to make significant biological discoveries with fewer tests and smaller datasets.
Gene Enrichment Applications
The applications of the GO partition method displayed in the results section (Figure 4 and Figure 5) show GO term partitions to be an information-theory based approach that combines consistency and precision to allow greater insight into biological systems. While the coloring of protein networks by GO annotation is not new, the choice of GO annotation with which to color may be improved. We have here introduced one such improvement over testing all GO terms for enrichment and over relying on conven-tional “GO levels.” The use of GO partitions enables visually striking graph coloring that has the added
benefits of highly consistent information content of GO nodes used to color a graph or identify function-al enrichment, as well as allowing the researcher full freedom to choose the number of GO nodes for partitioning or analysis. Without such consistency, an investigator may conclude that a graph with an over-representation of a particular color (for a GO term) symbolizes a significant finding. For instance, Figure S3.a., S4.b., and S5 associate nodes with their GO level-based annotation. Here, very common processes (e.g., physiological process, cellular process) are commonly shared by the proteins offering little new information (since it would not be surprising if they were associated with many of the pro-
teins). In addition, due to lack of information consistency, the GO level also includes very specific terms (e.g., pigmentation, viral life cycle) that are inappropriate alongside the very common processes, and also offer little new information (as they would be expected to be unrelated to most proteins). In contrast, GO term partitions offer a much more “information consistent” way of analyzing functional
data (see Figure 4.a., Figure 4.b., and Figure 5). Here, we see more information on significant findings (e.g. biopolymer metabolism, cellular protein metabolism) compared to the GO level-based figures. Dynamic Nature of Ontology Information
In biomedical research, ontologies often cannot be designed perfectly from the outset because not all of the terms or their annotation frequencies are known during the initial ontology design stage. Thus, encoding specificity in graphical hierarchical levels is difficult and prone to change as new terms and annotated genes are added. Our approach provides a way of capturing the actual specificity of an onto-logical term at any point given the current graphical structure and gene annotation. As a result, our me-thod is designed to be robust in the face of structural changes to the ontology as well as annotation addi-tions and revisions. Another way to deal with these changes would be to force consistency across term specificity and their hierarchical levels within the graph structure, which could be achieved by extending this paper’s framework to allow for the re-engineering of ontology graphical structures based on such
One of the new frontiers in ontologies involves incorporation of different ontologies into genomic
16analysis, thus adding additional complexity . While the original ontologies were not defined with such
integration in mind, probabilistic-based approaches such as the one described here can be extended to these domains as well. One can also use the information approaches described here to compare and
17combine ontologies into larger, meta-ontologies. For example, ontologies and databases such as GO ,
181920MIPS, YPD , and EcoCyc contain overlapping information. The information approaches provide