Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--add-relatives outputs unexpected sequence ID #99

Open
glajoie1 opened this issue Apr 16, 2021 · 3 comments
Open

--add-relatives outputs unexpected sequence ID #99

glajoie1 opened this issue Apr 16, 2021 · 3 comments
Labels
Milestone

Comments

@glajoie1
Copy link

Hello,

I have been using SINA on a 16S sequences fasta file with the following command-line to obtain an alignment that included neighbour sequences, as in the online ACT implementation for small sequence sets. The reference database was downloaded from Silva.

sina -i ~/asv_ps20.fa -r ~/SILVA_138.1_SSURef_NR99_12_06_20_opt.arb -o aligned.fasta.gz -o aligned.csv --add-relatives=15

In the output alignment file, I was expecting the 'relatives' sequences to correspond to the reference sequences identified in the align_filter_slv column of the output (e.g. JF769553.1, KJ855315.1) but I am rather getting sequence IDs that are not retrievable in the Silva reference database (e.g. GYJUndar, UncCy339). The same thing happens when I'm adding the '--search' flag.

Is there a way to get the sequences identified in the align_filter_slv column in the alignment file with the query sequence? (Or get information on name matching if this is a formatting issue?)

Thank you very much for your software!

@epruesse
Copy link
Owner

Those are the "ARB names". Each sequence in ARB has a couple of meta-data fields, "acc" holds the accession number and "name" holds that name that you are seeing. It's an ID generated from the sequence description ("UncCy399" will be something uncultured) such that it's unique for accession + start position (to account for genomes with multiple 16S).

In theory, you should be able to export the accession into the csv using -f acc. In practice that doesn't seem to be working. I'll mark this as bug. Also - the accession should always be listed in the CSV, I think.

@epruesse epruesse added the bug label Apr 17, 2021
@glajoie1
Copy link
Author

Ok - thank you for the information. The accession was not listed in the csv, so I generated a mapping file of the arb names to the silva accession numbers and taxonomy through the arb software using the SILVA_138.1_SSURef_NR99_12_06_20_opt.arb database.

@epruesse
Copy link
Owner

Just be aware you might get dups on the acc alone. In SILVA acc + start uniquely identify a SSU/LSU sequence, with start being the first base of the sequence within its accession number sequence.

@epruesse epruesse added this to the 1.7.3 milestone Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants