Export track data like in the Jbrowse 1 #3094

Averstic · 2022-07-19T08:01:52Z

Discussed in #2810

^{Originally posted by Marie-Lahaye March 15, 2022}
Hi,

I was wondering if there was a way to export track data, like in the Jbrowse 1? I remember that we could export track data on the region that we were visualizing.

For example exporting genes in GFF3 format from a specific region:

Is there a similar feature with the Jbrowse 2?

Thanks for any answer you have for me !
Marie

Averstic · 2022-07-19T08:04:06Z

Would really be interested in this feature in JB2 as it is an essential way of sharing data with non-computational scientists.

cmdcolin · 2022-07-20T21:54:04Z

thanks for adding interest in this @Averstic

what is the main feature that you generally use this for? is it the GFF export of a region?

Averstic · 2022-07-28T12:38:18Z

Yes indeed, the GFF export of a region would be of interest. Along with the possibility to extract the reference sequence of the current view, this in order to quickly extract sequence and annotation for import in other tools.

Averstic · 2022-08-03T14:24:40Z

@cmdcolin
Just interested, would you consider this feature as a hard feature to create?

cmdcolin · 2022-08-09T01:37:11Z

@Averstic it could be somewhat challenging. there are two general approaches which are not necessarily mutually exclusive

exporting chunks of the original source data file, which is more true to the original data but might not work well with things like REST APIs
making a general data export system where any feature can be translated to some data format. This system is basically how jbrowse 1 does it, but there are odd corner cases in this that can be difficult to handle properly e.g. choosing what the appropriate file formats are for a given track, generating accurate serializations of arbitrary data, integrating with our plugin system etc. this general system can result in data conversions like vcf to gff, bam to gff, bigwig to bed or other things like that which could be good or bad depending on your point of view:)

Averstic · 2022-09-05T11:49:40Z

Thank you for this insight.

On option 1, it would still be necessary to recalculate some of the coordinates of the original source file, as those are genome wide coordinates, while it would be more useful to be able to extract 'local' coordinates.

On option 2. what would be the reason this feature was not carried over between the two versions? And how likely is it that this feature would make it to any future release?

cmdcolin · 2022-09-14T19:17:59Z

On option 1, it would still be necessary to recalculate some of the coordinates of the original source file, as those are genome wide coordinates, while it would be more useful to be able to extract 'local' coordinates.

what type of workflow would want this transformation? just a note that jbrowse 1 did not do transformations like this.

On option 2. what would be the reason this feature was not carried over between the two versions? And how likely is it that this feature would make it to any future release?

I think it is possible this could make it into a future release. it wasn't intentional to not carry it over, just a limitation of dev resources

scottcain · 2023-02-11T02:45:53Z

With regard to the work mentioned in #3439, this is the feedback from the WormBase user:

Hi, Scott. Thank you so much for you and jbrowse2 developer to develop this prototype so quickly.
To test this prototypes, I download region I:3250911..3307532 from both jbrowse2 and gbrowse in C. elegans. This jbrowse2 genbank is close, but still not work.

I make some suggestions as follows:

To make snapgene recognizing this genebank, at least 5 spaces are required between feature key and coordinates. protein_coding_p1..4585 should be changed to protein_coding_p 1..4585
It is better to only use genbank Feature Key, such as gene,CDS,exon,ncRNA, not just use protein_coding_p. The detailed “Standard Feature” explaination can be found at https://www.insdc.org/submitting-standards/feature-table/#7.2. In this link, Appendix II contains descriptions of all feature keys.
The most important function we rely on in genbank is the joined coordinates for CDS`` feature key, like CDS join(16192…16313,16362…16851,16900…17084,17142…17409,17491…17889)```. All exons in the coding region of the gene are treated as the one feature, not multiple separate features. In that way, we can translate this CDS directly with snapgene and view amino acid sequence with snapgene.
I found an inconsistent coordinates between jborwse2 and gborwse in the region I:3250911..3307532 I download. In jbrowse2 genbank format, coordinate of W01B11.2 is protein_coding_p 36569..42217. But in gbrowse, coordinate of W01B11.2 is CDS join(36470..36612,36774..36843,37398..37623,37675..38141,38922..39602,40358..40464,40584..40768,40817..40915,40996..41100,41492..41599,41725..41855,41899..42004, 42075..42217). The start coordinate is different between two version. 36569 for jbrowse2 and 36470 for gbrowse.

scottcain · 2023-02-13T18:26:41Z

Items 1 and 2 above are pretty easy, and item 4 could be differences in annotation release but might be something. Item three is a little more tricky; I assume the CDS features need to be stashed when iterating through the features and I don't really know how to do that in JB/React. I may take a crack at 1 and 2 though.

cmdcolin · 2023-02-13T18:30:40Z

that is all very good feedback. (3), the CDS join, is attempted in the save_track_data branch but not sure if he received the same

see
packages/core/pluggableElementTypes/models/components/genbank.ts

https://github.com/GMOD/jbrowse-components/pull/3439/files#diff-2dd5f778cfc0e380e2b331d00f05c827002537cd874f7475f4ab4f44a28ad1cdR78-R91

welcome to try out further work on that branch

scottcain · 2023-02-13T21:00:00Z

the wormbase user updated his comment to say that he was using the wrong track which is why he didn’t get the results he expected and now all is well; his only additional comment would be that it would be nice for the downloaded file name to include the location in the name to prevent name collisions.

scottcain · 2023-02-24T21:01:08Z

New update from the WormBase user where he found what looks like a bug. From https://community.alliancegenome.org/t/genbank-format-downloading-from-jbrowse1-2/6772/11:

user:

     mRNA            complement(13673..15502)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(15426..15502,15288..15369,15060..15242,14642..14750,14435..14594,14020..14389,13673..13972))
                     /mRNA="transcript:Cnig_chr_X.g24897"

I found a bug. The CDS feature is not recognized in the above genebank. This error may originate from long multiple lines info in mRNA feature.

Me:

Interesting, if you manually take out the carriage returns in the “info” does it then work? I’m trying to figure out what we need to do generally, since that info section can frequently be quite long.

User:

     mRNA            complement(13673..15502)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
                     method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
                     method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
                     method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
                     method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(15426..15502,15288..15369,15060..15242,14642..14750,14435..14594,14020..14389,13673..13972))
                     /mRNA="transcript:Cnig_chr_X.g24897"

The above format worked.

scottcain · 2023-03-01T03:46:08Z

@cmdcolin I don't think changes to implement this ^^^ made it into the last PR (where the "method" lines are spaced over to the rest of the text); I took a look at the code diffs for that branch and didn't see the obvious place to make a change, so I'm going to have to ask you to do it too.

scottcain · 2023-03-20T17:02:08Z

To reproduce:

Got to https://s3.amazonaws.com/agrjbrowse/test/save-track-data/index.html?session=share-vsTgjNX2Oi&password=5qEF3
The resulting genbank output looks like:

LOCUS       CM008514.1:14313335..14315164 1830 bp        DNA       linear    UNK 20-MAR-2023
FEATURES             Location/Qualifiers
     gene            complement(1..1830)
                     /name=gene:Cnig_chr_X.g24897
                     /biotype="protein_coding"
                     /id="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     mRNA            complement(1..1830)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(1754..1830,1616..1697,1388..1570,970..1078,763..922,348..717,1..300))
                     /mRNA="transcript:Cnig_chr_X.g24897"
ORIGIN
	1

but it needs to look like:

LOCUS       CM008514.1:14313335..14315164 1830 bp        DNA       linear    UNK 20-MAR-2023
FEATURES             Location/Qualifiers
     gene            complement(1..1830)
                     /name=gene:Cnig_chr_X.g24897
                     /biotype="protein_coding"
                     /id="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     mRNA            complement(1..1830)
                     /gene="gene:Cnig_chr_X.g24897"
                     /name=transcript:Cnig_chr_X.g24897
                     /id="transcript:Cnig_chr_X.g24897"
                     /info="method:InterPro accession:IPR013750 description:GHMP kinase, C-terminal domain 
                     method:InterPro accession:IPR014721 description:Ribosomal protein S5 domain 2-type fold, subgroup 
                     method:InterPro accession:IPR015192 description:Switch protein XOL-1, N-terminal 
                     method:InterPro accession:IPR015193 description:Switch protein XOL-1, GHMP-like 
                     method:InterPro accession:IPR020568 description:Ribosomal protein S5 domain 2-type fold"
                     /jbrowse_parent="gene:Cnig_chr_X.g24897"
                     /Name="Cnig_chr_X.g24897"
     CDS             complement(join(1754..1830,1616..1697,1388..1570,970..1078,763..922,348..717,1..300))
                     /mRNA="transcript:Cnig_chr_X.g24897"
ORIGIN
	1

cmdcolin · 2023-03-20T19:26:18Z

potentially the issue highlighted above points to a need to re-urlencode things before writing out, to gff/genbank but may be useful for the upstream wormbase-pipeline to not have newlines

scottcain · 2023-03-21T20:06:25Z

So given that the problem I cited above is really with the GFF (and will hopefully be fixed with the next WB release), do you feel comfortable pushing this into main, or do you want to add the re-encoding first? I don't have a strong opinion since it shouldn't be a problem for "well behaved" gff.

ETA:
Oh, but if the GFF that gets dumped out isn't getting URI encoded, that would be kind of a problem. I guess that should be dealt with first.

cmdcolin · 2023-03-21T22:02:22Z

I think that I would like this PR to improve architecturally and code quality wise before merge to main. it is a good proof of concept but may help to evolve a little bit before merge. I can keep this branch updated with main so you can keep using it

cmdcolin · 2023-03-21T22:02:50Z

also, if possible, keep the discussion of the particular PR on the PR page

cmdcolin mentioned this issue Sep 30, 2022

Possibility to query data from plots? cmdcolin/jbrowse-plugin-gwas#6

Open

scottcain mentioned this issue Oct 20, 2022

Add a "save track data" option #3284

Closed

scottcain added the enhancement New feature or request label Oct 20, 2022

cmdcolin added the high impact label Nov 4, 2022

cmdcolin mentioned this issue Jan 5, 2023

Save track data method on base track model #3439

Draft

rbuels added the good first issue Good for newcomers label Jan 20, 2023

scottcain mentioned this issue Mar 20, 2023

GFF generation for C. nigoni WormBase/wormbase-pipeline#258

Open

cmdcolin removed the good first issue Good for newcomers label Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export track data like in the Jbrowse 1 #3094

Export track data like in the Jbrowse 1 #3094

Averstic commented Jul 19, 2022

Averstic commented Jul 19, 2022

cmdcolin commented Jul 20, 2022

Averstic commented Jul 28, 2022

Averstic commented Aug 3, 2022

cmdcolin commented Aug 9, 2022

Averstic commented Sep 5, 2022

cmdcolin commented Sep 14, 2022

scottcain commented Feb 11, 2023

scottcain commented Feb 13, 2023

cmdcolin commented Feb 13, 2023

scottcain commented Feb 13, 2023

scottcain commented Feb 24, 2023

scottcain commented Mar 1, 2023

scottcain commented Mar 20, 2023

cmdcolin commented Mar 20, 2023

scottcain commented Mar 21, 2023 •

edited

Loading

cmdcolin commented Mar 21, 2023

cmdcolin commented Mar 21, 2023

Export track data like in the Jbrowse 1 #3094

Export track data like in the Jbrowse 1 #3094

Comments

Averstic commented Jul 19, 2022

Discussed in #2810

Averstic commented Jul 19, 2022

cmdcolin commented Jul 20, 2022

Averstic commented Jul 28, 2022

Averstic commented Aug 3, 2022

cmdcolin commented Aug 9, 2022

Averstic commented Sep 5, 2022

cmdcolin commented Sep 14, 2022

scottcain commented Feb 11, 2023

scottcain commented Feb 13, 2023

cmdcolin commented Feb 13, 2023

scottcain commented Feb 13, 2023

scottcain commented Feb 24, 2023

scottcain commented Mar 1, 2023

scottcain commented Mar 20, 2023

cmdcolin commented Mar 20, 2023

scottcain commented Mar 21, 2023 • edited Loading

cmdcolin commented Mar 21, 2023

cmdcolin commented Mar 21, 2023

scottcain commented Mar 21, 2023 •

edited

Loading