Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_cvterm vs feature_dbxref vs featureprop for feature annotations #74

Open
bradfordcondon opened this issue Nov 7, 2018 · 3 comments

Comments

@bradfordcondon
Copy link
Contributor

bradfordcondon commented Nov 7, 2018

Hello,

@mpoelchau and myself have been discussing the behavior of storing GFF files for feature annotations via Tripal. We are considering a gene that perhaps has been annotated with GO terms, KEGG terms, proposed PFAM domains, and Interproscan family annotations.

My understanding of the Chado tables (which i want to emphasize is up for debate) is:

  • feature_cvterm is for annotating features with all of the cases I described above (GO, KEGG, PFAM) because some decision was made based on computational evidence to associate the feature with that annotation. The feature_cvtemrprop table exists to store evidence codes, qualifiers, etc.
  • feature_dbxref is for storing references to that record, itself, in anotehr database. So it should only be used to link back to the feature itself on a different site. Gene families its a part of, for example, wouldnt belong here.
  • featureprop: its hard for me to distinguish when a term annotation is better suited as a featureprop. props can have pubs for evidence but theres no featurepropprop table for evidence codes. Also, the "value" field seldom may not make sense if tagging with an annotation.

I'll add this is the most definitive guidance i found in my search on the chado wiki in the sequence module manual

Detailed annotations, such as associations to Gene Ontology (GO) terms or Cell Ontology terms, can be attached to features using the feature_cvterm linking table. This allows multiple ontology terms to be associated with each feature.
Provenance data can be attached with the feature_cvtermprop and feature_cvterm_dbxref higher-order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using feature_cvterm. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) that are linked to the gene features.
Annotations for existing features can also go into the featureprop table using the Chado feature_property ontology (defined in chado/load/etc/feature_property.obo) and the comment or description terms as appropriate. The purpose of the feature property ontology (and the related chado/load/etc/genbank_feature_property.obo file) is to capture terms that are likely to appear in GFF or GenBank sequence files. In theory there is no overlap between these ontologies and the Sequence Ontology.

Insofar as the GFF file holding the annotations:

The gff spec states: Two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database. Ontology_term is reserved for associations to ontologies, such as the Gene Ontology. Dbxref is used for all other cross references. While there is no firm boundary line between these two concepts, curators tend to treat ontology associations differently and hence ontology terms have been given their own reserved attribute label.

similarly, NCBI calls most things dbxrefs in a much broader definition than the one i use above.

Here's the conflict. KEGG terms, for example, are not ontologies. But when we read the GFF file, we parse Ontology_terms into feature_cvterm, dbxrefs to dbxrefs, and everything else to props. So for the annotations to go into feature_cvterm, they would need to be in the GFF under ontology_terms.

As monica phrased her doubts:

With GO, I get it - a GO term refers to a formal, accessioned description of a gene function (e.g. http://amigo.geneontology.org/amigo/term/GO:0003676). A GO term does not also refer to a protein sequence - you annotate the protein sequence with the GO term. An InterPro accession is an accessioned ‘signature’ (which is a combo of HMMs, profiles, position-specific scoring matrices or regular expressions), which is annotated by curators with free-text descriptions from the literature. (And they can also be associated with a GO term). As such, I view InterPro domain accessions more as entries within a very authoritative database, rather than a controlled vocabulary. Although perhaps the domain name is enough to call it a controlled vocabulary at this point?

The consequence of these decisions is we display featureprops, feature_cvterms, and feature_dbxrefs in different locations and in different ways to end users.

@bradfordcondon bradfordcondon changed the title feature_cvterm vs feature_dbxref for feature annotations feature_cvterm vs feature_dbxref vs featureprop for feature annotations Nov 7, 2018
@childers
Copy link

childers commented Nov 7, 2018

In my past experience, cvterms and dbxrefs have each been a pain point in chado implementations. The flexibility is great, until you need to try figuring our how someone else decided to store the information in what tables.

I'm totally onboard with having some more guidance and standards, if only to make life easier for all of us to work together.

@spficklin
Copy link
Contributor

spficklin commented Nov 7, 2018

@bradfordcondon I agree your initial bullet list of what each table is meant to store. While KEGG termns and InterPro domains lack a formal OWL or OBO file (although there have been past attempts to create these, at least for KEGG as far as I remember), in my mind they serve the same purpose as an ontology and I am inclined to store those associations with a genomic feature in the feature_cvterm table. A cvterm association is a "property" of a feature, so technically it could go into the featureprop table, but the existence of the featuer_cvterm table to me implies that these type of "properties" should be handled separately and I would be inclined to then put GO/KEGG/Interpro annotations to a genomic feature all in the feature_cvterm table.

@scottcain
Copy link
Member

I further agree with @spficklin . I would say that @bradfordcondon 's first two bullet points are spot on, and while I agree that it can be somewhat difficult in some cases to tell the difference, I would posit that we generally "feel" the right answer (documenting feelings is admittedly difficult). Finally, featureprop is obviously a catch all for "everything else". I imagine there will be cases where things that get put into featureprop should end up being moved elsewhere when a human looks at them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants