Determining the relationship of a set of proteins from literature. Note on definitions - I think I've mentioned before, that we have our own special definition of a protein "super-family" when creating annotation, but its probably worth spelling it out and it seems appropriate to do it here. We're using the term 'super-family' to describe a collection of proteins that can exist at any stage in a hierarchy except the lowest level. For example, in the following hierarchy- A A is a parent of B and C, / \ B and C are children of A and siblings of each other, / \ B is a parent of D and E, C is a parent of F and G, B C D and E are grandchildren of A, children of B and siblings / \ / \ of each other, D E F G F and G are grandchildren of A, children of C and siblings of each other. As an illustrative example, A might represent the G protein-coupled receptors (GPCRs) as a whole (of which there are many), B might represent the dopamine receptors (which are a subset of GPCRs), and D and E might represent individual receptor subtypes (subsets of the dopamine receptors). In classical pharmacology/biology, only A would be termed a true super-family. However, in our definition, A, B and C are referred to as super-families, since they all have children. D, E, F, and G would be referred to as families. ############################### Fingerprint type analysis using literature: The relationship of the component sequences making up a fingerprint, also known as the fingerprint type, is one of the first things Prints annotators determine before adding information to a fingerprint since this will effect both the format of the annotation and the type of information reported. There are several ways of determining fingerprint type from Swiss-Prot, some of which are implemented in Precis which have been documented in the Prints BioMinT specification document. However, assuming that the analysis based on Swiss-Prot records has failed, or (more usually) the results are open to interpretation and require clarification, annotators usually turn to the literature. The most simple ways of determining/augmenting fingerprint type from the literature seem to lie in 'is a' or 'encodes a' relationships. For example - > "We have cloned and sequenced a cDNA (JAK3) encoding a novel member of the > JAK family of protein tyrosine kinases." [translation - Jak3 is a member of the JAK family of protein tyrosine kinases] Often protein relationships may also be indicated by the use of 'membership statements', such as X is a member of the family Y, etc. For example, - > "Janus kinases (JAK) play a crucial role in the initial steps of cytokine > signalling. Each of the four members (JAK1, JAK2, JAK3, TYK2) of this non- > receptor tyrosine kinase family is indispensable for the effects of distinct > cytokines." [translation - JAK1, JAK2, JAK3 AND TYK2 are janus kinases, which are non-receptor tyrosine kinases] Its conceivable that syntax, punctuation or brackets in conjunction with super-family titles may also indicate the composition of the super-family. For example - > The janus kinases (JAK) JAK1, JAK2, and TYK2 are protein tyrosine kinases, > which play a pivotal role in the signal transduction process mediated by > cytokines..." [translation - JAK1, JAK2, JAK3 and TYK2 are janus kinases, which are protein tyrosine kinases] Or, > "Four mammalian JAK family members have been identified: JAK1, JAK2, JAK3, > and TYK2..." [translation - JAK1, JAK2, JAK3 and TYK2 are (mammalian) members of the JAK family] Or, > "The Janus kinase family of proteins, with four mammalian members (JAK1, > JAK2, JAK3 and TYK2),.." [translation - JAK1, JAK2, JAK3 and TYK2 are (mammalian) members of the JAK family] ############################################ # Real life examples. I've included here a set of "real life" statements which can be used to infer hierarchical relationships and therefore fingerprint type from the abstracts in the fprint_ir.tar collection I sent round a little while ago. Although it was tempting to deal with synonyms and function since there's lots of overlap between those type of statements and those relating to protein hierarchies, I've limited the information to family relationships to try to keep it simple. For each statement I've included a PubMed ID, so the statement can be traced, plus a "translation" of what I think the important part of the statement is with respect to family relationships (enclosed in square brackets). On the whole, there seemed to be enough information in the abstracts to extract at least some information on family relationships. The main problems were the domain examples. The only pointer I could find for kringle domains was the 'is found in' statement. Mining literature on notch repeats also suggests notch is a super-family rather than a domain. This is understandable since there is indeed a notch super-family, members of which contain notch repeats. Resolving whether a fingerprint represents the notch super-family or the repeat is confusing enough manually let alone automatically! Also, I could find no information to place DAP3 (death-associated protein-3) into a hierarchy. This is reflected biologically - it isn't clear to which super-family DAP3 belongs. Finally, I was unable to determine a family relationship for dishevelled, which is a super-family with 3 children. However, there are only 3 dishevelled related abstracts in the list and its likely that searching a wider variety of abstracts would help. "Using an expression cloning strategy we have identified a novel membrane bound aspartic protease, BACE1 and demonstrated that it exhibits all known properties of beta-secretase." [BACE1 is a aspartic protease] "BACE2, a novel protease homologous to BACE1, was also identified, and the two BACE enzymes define a new family of transmembrane aspartic proteases." [BACE1 and BACE2 are siblings and are members of the transmembrane aspartic protease family] "The transmembrane aspartyl protease BACE has been identified as beta -secretase" [BACE, also known as beta-secretase, is a transmembrane aspartyl protease] "Here we report that Asp 2, a novel transmembrane aspartic protease, has the key activities expected of beta-secretase." [Asp 2 is a transmembrane aspartic protease] BACE2, a novel protease homologous to BACE1, was also identified, and the two BACE enzymes define a new family of transmembrane aspartic proteases. [BACE2 is a sibling of BACE1 and the two are transmembrane aspartic proteases] "Beta-site amyloid precursor protein cleaving enzyme (BACE) is a novel transmembrane aspartic protease that possesses all the known characteristics of the beta-secretase involved in Alzheimer's disease." [BACE is a transmembrane aspartic protease] "Janus kinases (JAK) play a crucial role in the initial steps of cytokine signalling. Each of the four members (JAK1, JAK2, JAK3, TYK2) of this non-receptor tyrosine kinase family is indispensable for the effects of distinct cytokines." [JAK1, JAK2, JAK3, TYK2 are the 4 members of the janus kinase (JAK) family which is a non-receptor tyrosine kinase family] "Individual receptors associate with, or require, one or more of the three known family members including JAK1, JAK2, and tyk2." [JAK1, JAK2 AND tyk2 are siblings] "We previously identified a novel protein tyrosine kinase gene, tyk2, by screening a human lymphoid cDNA library with a tyrosine kinase domain specific c-fms restriction fragment under low stringency hybridization conditions." [tyk2 is a protein tyrosine kinase] "Kringle domains are found in several plasma proteins of blood coagulation and fibrinolysis." [Kringle domains are found in some plasma proteins] "Members of the Notch gene family are thought to mediate inductive cell-cell interactions during development of a wide variety of vertebrates and invertebrates." [Notch is a family (super-family), consisting of more than 1 member] "Members of the Notch family (e.g. Notch1 and Notch3) have been recently described to play a critical role in T cell development and their constitutive activation has been related to T cell leukaemia in both animal models and human disease." [Notch is a family (super-family), and has children called Notch1 and Notch3] "G-protein-coupled receptors (GPCRs) form a large protein family that plays an important role in many physiological and pathophysiological processes." [GPCRs are a large super-family] "Here we report the molecular cloning, expression, localization, and functional characterization of a human G protein-coupled receptor that has the expected characteristics of a CysLT(2) receptor." [CysLT(2) receptor may be a member of the G protein-coupled receptor super -family] "The cysteinyl-leukotrienes (LT) activate another group called CysLT receptors, which are referred to as CysLT(1) and CysLT(2)." [CysLT(1) and (2) are siblings, and children of the CysLT receptors super -family] "Thus, the GPR40, GPR41, GPR42, and GPR43 genes, respectively, occur downstream from CD22, a gene previously localized on chromosome 19q13.1. The four putative novel human genes encode new members of the GPCR family and share little homology with GALR." [GPR40, GPR41, GPR42, and GPR43 are members of the GPCR super-family] "On the basis of structural information the TRP family is subdivided in three main subfamilies: the TRPC (canonical) group, the TRPV (vanilloid) group and the TRPM (melastatin) group." [The TRP super-family has 3 children - the TRPC, TRPV and TRPM families] "Mammalian homologues of the Drosophila transient receptor potential (TRP) channel gene encode a family of at least 20 ion channel proteins." [Transient receptor potential (TRP) is a super-family with at least 20 children] "TRP channel proteins constitute a large and diverse family of proteins that are expressed in many tissues and cell types." [TRP channel proteins constitute a super-family] "The TRP channels can be divided, on the basis of their homology, into three TRP channel (TRPC) subfamilies: short (S), long (L) and osm (O). From the evidence available to date, this subdivision can also be made according to channel function. Thus, the STRPC family, which includes Drosophila TRP and TRPL and the mammalian homologues, TRPC1-7, is a family of Ca2+-permeable cation channels that are activated subsequent to receptor-mediated stimulation of different isoforms of phospholipase C." [The TRP channel super-family has 3 subfamilies/children, STRPC, LTRPC and OTRPC. STRPC has 9 further children - Drosophila TRP and TRPL, and TRPC1 to TRPC7] "Mammalian transient receptor potential channels (TRPCs) form a family of Ca(2+)-permeable cation channels currently consisting of seven members, TRPC1-TRPC7.". [The mammalian TRPC super-family has at least 7 children termed TRPC1 to TRPC7] "The PMP22/EMP/MP20 gene family includes four closely related proteins, peripheral myelin protein-22 (PMP22), epithelial membrane protein-1 (EMP-1), epithelial membrane protein-2 (EMP-2), and epithelial membrane protein-3 (EMP-3), which share amino acid identities ranging from 33 to 43%. In addition, the lens-specific membrane protein MP20 represents a more distant relative." [The PMP22/EMP/MP20 super-family has 5 children termed EMP-1 to -3 and PMP22. MP20 is a more distantly related child of the super-family]. "Here we show that the mouse Frizzled-1, -2, -4 and -7 can bind to proteins of the PSD-95 family, which are implicated in the assembly and localization of multiprotein signaling complexes in the brain." [Frizzled is a super-family with at least 4 children Frizzled-1, -2, -4 and -7] "In Drosophila, two closely related serpentine receptors, Frizzled (Fz) and D-Frizzled2 (Fz2) are able to act as receptors for the secreted Wnt peptide," [The Drosophila receptors Fz and Fz2 may be siblings] "Tuberous sclerosis (TSC) is an autosomal dominant disorder caused by a mutation in either the TSC1 or TSC2 tumour suppressor gene. The disease is characterized by a broad phenotypic spectrum that can include seizures, mental retardation, renal dysfunction and dermatological abnormalities. TSC2 encodes tuberin, a putative GTPase activating protein for rap1 and rab5. The TSC1 gene was recently identified and codes for hamartin, a novel protein with no significant homology to tuberin or any other known vertebrate protein" [TSC1 and TSC2 are not siblings] "Claudins comprise a multigene family, and each member of approximately 23 kDa bears four transmembrane domains. To date, 15 members of this gene family have been identified." [Claudin is a super-family with at least 15 children] "The claudin superfamily consists of at least 18 homologous proteins in humans" [Claudin is a superfamily with at least 18 members] "Chemokine receptors have joined the ranks of other members of the G-protein -coupled receptor (GPCR) family in therapeutic potential as small-molecule chemokine receptor antagonists move from discovery to the clinic. Chemokine receptors belong to the rhodopsin family of GPCRs and, as such, are expected to be closely related in structure to other Class A members." [Chemokine receptors are members of the GPCR super-family (or, in more detail, members of the Class A branch of the GPCR super-family)] "GABAA and GABAC receptors are members of a super-family of transmitter-gated ion channels that include nicotinic acetylcholine, strychnine-sensitive glycine and 5HT3 receptors." [GABAA and GABAC receptors are members of the transmitter-gated ion channels super-family. Their siblings include nicotinic acetylcholine, strychnine -sensitive glycine and 5HT3 receptors" .. Note that the preceding sentence states "Conformationally restricted analogues of GABA have been used to help identify three major GABA receptors, termed GABAA, GABAB and GABAC receptors." Since GABAB receptors aren't listed as members of the transmitter-gated ion channels super-family, but GABAA and GABAC are, we could infer that GABAB is not a sibling of GABAA and C, which is true since GABAB is a metabotropic rather than ionotropic receptor. "Fast synaptic inhibition in the brain is largely mediated by ionotropic GABA receptors, which can be subdivided into GABAA and GABAC receptors based on pharmacological and molecular criteria." [The ionotropic GABA receptor super-family has two children - GABAA and GABAC] "Rapid signaling across the synaptic junction is partially mediated by the ligand-gated ion channel superfamily (LGICS), which includes inhibitory glycine and GABA receptors and excitatory acetylcholine and serotonin receptors." [The ligand-gated ion channel super-family has a number of children including inhibitory glycine and GABA receptors and excitatory acetylcholine and serotonin receptors] "Inhibitory glycine receptors (GlyRs) are members of the nicotinic acetylcholine receptor superfamily and inhibit neuronal firing by opening Cl(-) channels following agonist binding." [Inhibitory glycine receptors are members of the nicotinic acetylcholine receptor super-family] Note - just to complicate matters further, glycine receptors are *not* nicotinic acetylcholine (nACh) receptors. The nACh receptor super-family is a misnomer based (if I remember correctly) on the fact nACh were the first receptors in this super-family to be properly charaterised. A more accurate term would be nACh receptor like super-family (or indeed ligand-gated ion channel super-family). "nAChRs are pentameric transmembrane proteins into the superfamily of ligand-gated ion channels that includes the 5HT3, glycine, GABAA, and GABAC receptors." [nAChR and its siblings, which include 5HT3, glycine, GABAA, and GABAC receptors, are children of the ligand-gated ion channel super-family]