Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export track data like in the Jbrowse 1 #3094

Open
Averstic opened this issue Jul 19, 2022 Discussed in #2810 · 18 comments
Open

Export track data like in the Jbrowse 1 #3094

Averstic opened this issue Jul 19, 2022 Discussed in #2810 · 18 comments
Labels
enhancement New feature or request high impact

Comments

@Averstic
Copy link

Discussed in #2810

Originally posted by Marie-Lahaye March 15, 2022
Hi,

I was wondering if there was a way to export track data, like in the Jbrowse 1? I remember that we could export track data on the region that we were visualizing.

For example exporting genes in GFF3 format from a specific region:

image

Is there a similar feature with the Jbrowse 2?

Thanks for any answer you have for me !
Marie

@Averstic
Copy link
Author

Would really be interested in this feature in JB2 as it is an essential way of sharing data with non-computational scientists.

@cmdcolin
Copy link
Collaborator

thanks for adding interest in this @Averstic

what is the main feature that you generally use this for? is it the GFF export of a region?

@Averstic
Copy link
Author

Yes indeed, the GFF export of a region would be of interest. Along with the possibility to extract the reference sequence of the current view, this in order to quickly extract sequence and annotation for import in other tools.

@Averstic
Copy link
Author

Averstic commented Aug 3, 2022

@cmdcolin
Just interested, would you consider this feature as a hard feature to create?

@cmdcolin
Copy link
Collaborator

cmdcolin commented Aug 9, 2022

@Averstic it could be somewhat challenging. there are two general approaches which are not necessarily mutually exclusive

  1. exporting chunks of the original source data file, which is more true to the original data but might not work well with things like REST APIs

  2. making a general data export system where any feature can be translated to some data format. This system is basically how jbrowse 1 does it, but there are odd corner cases in this that can be difficult to handle properly e.g. choosing what the appropriate file formats are for a given track, generating accurate serializations of arbitrary data, integrating with our plugin system etc. this general system can result in data conversions like vcf to gff, bam to gff, bigwig to bed or other things like that which could be good or bad depending on your point of view:)

@Averstic
Copy link
Author

Averstic commented Sep 5, 2022

Thank you for this insight.

On option 1, it would still be necessary to recalculate some of the coordinates of the original source file, as those are genome wide coordinates, while it would be more useful to be able to extract 'local' coordinates.

On option 2. what would be the reason this feature was not carried over between the two versions? And how likely is it that this feature would make it to any future release?

@cmdcolin
Copy link
Collaborator

On option 1, it would still be necessary to recalculate some of the coordinates of the original source file, as those are genome wide coordinates, while it would be more useful to be able to extract 'local' coordinates.

what type of workflow would want this transformation? just a note that jbrowse 1 did not do transformations like this.

On option 2. what would be the reason this feature was not carried over between the two versions? And how likely is it that this feature would make it to any future release?

I think it is possible this could make it into a future release. it wasn't intentional to not carry it over, just a limitation of dev resources

@scottcain
Copy link
Member

With regard to the work mentioned in #3439, this is the feedback from the WormBase user:

Hi, Scott. Thank you so much for you and jbrowse2 developer to develop this prototype so quickly.
To test this prototypes, I download region I:3250911..3307532 from both jbrowse2 and gbrowse in C. elegans. This jbrowse2 genbank is close, but still not work.

I make some suggestions as follows:

  1. To make snapgene recognizing this genebank, at least 5 spaces are required between feature key and coordinates. protein_coding_p1..4585 should be changed to protein_coding_p 1..4585
  2. It is better to only use genbank Feature Key, such as gene,CDS,exon,ncRNA, not just use protein_coding_p. The detailed “Standard Feature” explaination can be found at https://www.insdc.org/submitting-standards/feature-table/#7.2. In this link, Appendix II contains descriptions of all feature keys.
  3. The most important function we rely on in genbank is the joined coordinates for CDS`` feature key, like CDS join(16192…16313,16362…16851,16900…17084,17142…17409,17491…17889)```. All exons in the coding region of the gene are treated as the one feature, not multiple separate features. In that way, we can translate this CDS directly with snapgene and view amino acid sequence with snapgene.
  4. I found an inconsistent coordinates between jborwse2 and gborwse in the region I:3250911..3307532 I download. In jbrowse2 genbank format, coordinate of W01B11.2 is protein_coding_p 36569..42217. But in gbrowse, coordinate of W01B11.2 is CDS join(36470..36612,36774..36843,37398..37623,37675..38141,38922..39602,40358..40464,40584..40768,40817..40915,40996..41100,41492..41599,41725..41855,41899..42004, 42075..42217). The start coordinate is different between two version. 36569 for jbrowse2 and 36470 for gbrowse.

@scottcain
Copy link
Member

Items 1 and 2 above are pretty easy, and item 4 could be differences in annotation release but might be something. Item three is a little more tricky; I assume the CDS features need to be stashed when iterating through the features and I don't really know how to do that in JB/React. I may take a crack at 1 and 2 though.

@cmdcolin
Copy link
Collaborator

that is all very good feedback. (3), the CDS join, is attempted in the save_track_data branch but not sure if he received the same

see
packages/core/pluggableElementTypes/models/components/genbank.ts

https://github.com/GMOD/jbrowse-components/pull/3439/files#diff-2dd5f778cfc0e380e2b331d00f05c827002537cd874f7475f4ab4f44a28ad1cdR78-R91

welcome to try out further work on that branch

@scottcain
Copy link
Member

the wormbase user updated his comment to say that he was using the wrong track which is why he didn’t get the results he expected and now all is well; his only additional comment would be that it would be nice for the downloaded file name to include the location in the name to prevent name collisions.

@scottcain
Copy link
Member

New update from the WormBase user where he found what looks like a bug. From https://community.alliancegenome.org/t/genbank-format-downloading-from-jbrowse1-2/6772/11:

user:

     mRNA            complement(13673..15502)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(15426..15502,15288..15369,15060..15242,14642..14750,14435..14594,14020..14389,13673..13972))
                     /mRNA="transcript:Cnig_chr_X.g24897"

I found a bug. The CDS feature is not recognized in the above genebank. This error may originate from long multiple lines info in mRNA feature.

Me:

Interesting, if you manually take out the carriage returns in the “info” does it then work? I’m trying to figure out what we need to do generally, since that info section can frequently be quite long.

User:

     mRNA            complement(13673..15502)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
                     method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
                     method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
                     method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
                     method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(15426..15502,15288..15369,15060..15242,14642..14750,14435..14594,14020..14389,13673..13972))
                     /mRNA="transcript:Cnig_chr_X.g24897"

The above format worked.

@scottcain
Copy link
Member

@cmdcolin I don't think changes to implement this ^^^ made it into the last PR (where the "method" lines are spaced over to the rest of the text); I took a look at the code diffs for that branch and didn't see the obvious place to make a change, so I'm going to have to ask you to do it too.

@scottcain
Copy link
Member

To reproduce:

  1. Got to https://s3.amazonaws.com/agrjbrowse/test/save-track-data/index.html?session=share-vsTgjNX2Oi&password=5qEF3

  2. The resulting genbank output looks like:

LOCUS       CM008514.1:14313335..14315164 1830 bp        DNA       linear    UNK 20-MAR-2023
FEATURES             Location/Qualifiers
     gene            complement(1..1830)
                     /name=gene:Cnig_chr_X.g24897
                     /biotype="protein_coding"
                     /id="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     mRNA            complement(1..1830)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(1754..1830,1616..1697,1388..1570,970..1078,763..922,348..717,1..300))
                     /mRNA="transcript:Cnig_chr_X.g24897"
ORIGIN
	1 

but it needs to look like:

LOCUS       CM008514.1:14313335..14315164 1830 bp        DNA       linear    UNK 20-MAR-2023
FEATURES             Location/Qualifiers
     gene            complement(1..1830)
                     /name=gene:Cnig_chr_X.g24897
                     /biotype="protein_coding"
                     /id="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     mRNA            complement(1..1830)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
                     method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
                     method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
                     method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
                     method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(1754..1830,1616..1697,1388..1570,970..1078,763..922,348..717,1..300))
                     /mRNA="transcript:Cnig_chr_X.g24897"
ORIGIN
	1 

@cmdcolin
Copy link
Collaborator

potentially the issue highlighted above points to a need to re-urlencode things before writing out, to gff/genbank but may be useful for the upstream wormbase-pipeline to not have newlines

@scottcain
Copy link
Member

scottcain commented Mar 21, 2023

So given that the problem I cited above is really with the GFF (and will hopefully be fixed with the next WB release), do you feel comfortable pushing this into main, or do you want to add the re-encoding first? I don't have a strong opinion since it shouldn't be a problem for "well behaved" gff.

ETA:
Oh, but if the GFF that gets dumped out isn't getting URI encoded, that would be kind of a problem. I guess that should be dealt with first.

@cmdcolin
Copy link
Collaborator

I think that I would like this PR to improve architecturally and code quality wise before merge to main. it is a good proof of concept but may help to evolve a little bit before merge. I can keep this branch updated with main so you can keep using it

@cmdcolin
Copy link
Collaborator

also, if possible, keep the discussion of the particular PR on the PR page

@cmdcolin cmdcolin removed the good first issue Good for newcomers label Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high impact
Projects
None yet
Development

No branches or pull requests

4 participants