Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GISAID data #63

Open
huddlej opened this issue Jun 16, 2022 · 0 comments
Open

Support for GISAID data #63

huddlej opened this issue Jun 16, 2022 · 0 comments
Labels
documentation Improvements or additions to documentation

Comments

@huddlej
Copy link
Contributor

huddlej commented Jun 16, 2022

For users who want to use GISAID data with this workflow, the following steps work nearly as expected.

These steps assume you have downloaded:

  • all sequences in FASTA format with whitespace replaced by underscore
  • patient metadata
# Download sequences: data/gisaid_pox_2022_06_16_19.fasta
# Download patient metadata: data/gisaid_pox_2022_06_16_19.tsv
# Note: patient metadata lacks submitting/originating lab.

# Parse out metadata from sequence deflines.
augur parse \
  --sequences data/gisaid_pox_2022_06_16_19.fasta \
  --fields strain gisaid_epi_isl date \
  --output-sequences data/sequences.fasta \
  --output-metadata data/sequence_metadata.tsv

# Join sequence metadata with patient metadata.
csvtk --tabs join -f 1 \
  data/sequence_metadata.tsv \
  data/gisaid_pox_2022_06_16_19.tsv > data/metadata.tsv

# TODO: Need a transform for GISAID locations like the one we have for GenBank.

# Run workflow.
# TODO: This step requires users to know that the "wrangling" of metadata renames the "strain" column to "strain_original"
# so they can rename it back to "strain". Correspondingly, the user has to tell the workflow not to use "strain_original"
# as the display strain name.
nextstrain build \
  --docker \
  --image=nextstrain/base:branch-nextalign-v2 \
  --cpus 1 \
  . \
  --configfile config/config_mpxv.yaml \
  --config strain_id_field=strain_original display_strain_field=strain

Note, the biggest issue with the implementation above is that there is no transform command to convert GISAID's location field to the standard Nextstrain geographic columns (region, country, division, and location). This means the default Augur filter logic that groups by country and year prints a warning message that it cannot find a "country" column and only groups. In Augur 16.0.0, this missing group-by column will produce an error message, so we should consider implementing the transform for GISAID locations.

Given the commands above, however, I get the following tree from the workflow:

image

The very long branches also indicate that users will need to manage their own list of strains to exclude, since strain names will not match GenBank accessions.

@huddlej huddlej added the documentation Improvements or additions to documentation label Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
No open projects
Development

No branches or pull requests

1 participant