ingest: deduplicate sequences using strain names #33

joverlee521 · 2022-06-07T21:10:23Z

Context

Once we've completed #32, we can use strain names to deduplicate sequences.
This is necessary in case different groups sequence the same virus or if sequences are generated from different protocols.
(NOTE: This is separate from the versioning in GenBank, we already pull in the latest version of GenBank sequences).

Description

The duplicate sequences should probably be filtered out in a new script (e.g. ingest/bin/deduplicate-records) OR potentially use the augur deduplicate command (see nextstrain/augur#919).

We probably want to keep a file with all sequences in case people want the duplicate sequences for any reason.
The deduplicated files will be the main ones used for LAPIS and/or our monkeypox builds.

The text was updated successfully, but these errors were encountered:

jameshadfield · 2022-06-15T01:37:07Z

Update: We currently have a duplicate in the hMPX build (MPXV-M5312_HM12_Rivers from accessions MT903340 and NC_063383). It’s not a huge problem as it's not in the current outbreak.

joverlee521 added the enhancement New feature or request label Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: deduplicate sequences using strain names #33

ingest: deduplicate sequences using strain names #33

joverlee521 commented Jun 7, 2022

jameshadfield commented Jun 15, 2022

ingest: deduplicate sequences using strain names #33

ingest: deduplicate sequences using strain names #33

Comments

joverlee521 commented Jun 7, 2022

Context

Description

jameshadfield commented Jun 15, 2022