Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom ARB database has no taxonomy fields #97

Open
johanneswerner opened this issue Nov 10, 2020 · 1 comment
Open

Custom ARB database has no taxonomy fields #97

johanneswerner opened this issue Nov 10, 2020 · 1 comment

Comments

@johanneswerner
Copy link

johanneswerner commented Nov 10, 2020

This is probably not a bug and maybe also documented somewhere, but I could not find any information about it.

I built a custom ARB database of a subset of the sequences from SILVA release 132 with the following command (after uncompressing):

sina -i SILVA_132_SSURef_Nr99_tax_silva.fasta -o custom_arb_database.arb --prealigned

By this, the created arb database has no taxonomy fields.

custom ARB database

custom_arb_database

official ARB database

official_arb_database

I wanted to classify my sequences afterwards with the custom database, but since the field tax_slv does not exist, this results in an empty file. However, if I choose as full_name as LCA field, I get results but I do not get the entire taxonomic path.

This is the result for one entry with tax_slv (and other tax_* fields) with the official ARB database

# sina command
sina \
  --in sequences.fasta \
  --out sina.fasta \
  --threads 36 \
  --db SILVA_132_SSURef_Nr99_tax_silva.arb \
  --fs-min 2 \
  --fs-msc 0.3 \
  --fs-full-len 500 \
  --search-min-sim 0.5 \
  --search \
  --search-db SILVA_132_SSURef_Nr99_tax_silva.arb \
  --search-max-result 1 \
  --lca-fields tax_slv,tax_embl,tax_ltp \
  --lca-quorum 0.3 \
  --meta-fmt \
  csv
name,align_cutoff_head_slv,align_cutoff_tail_slv,align_filter_slv,align_quality_slv,aligned_slv,full_name,lca_tax_embl,lca_tax_ltp,lca_tax_slv,nearest_slv,turn
TRINITY_DN279_c1_g1_i5,0,0,,54,2020-11-09 17:14:56,len=503 path=[1:0-102 4:103-274 18:275-285 19:286-327 20:328-502],Unclassified;,Unclassified;,Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Tracheophyta;Spermatophyta;Magnoliophyta;,BDFN01001194.1.11177.12965~0.559 ,turn-check disabled

and here the same entry with the full_name field of the custom database

# sina command
sina \
  --in sequences.fasta \
  --out sina.fasta \
  --threads 36 \
  --db custom_arb_database.arb \
  --fs-min 2 \
  --fs-msc 0.3 \
  --fs-full-len 500 \
  --search-min-sim 0.5 \
  --search \
  --search-db custom_arb_database.arb \
  --search-max-result 1 \
  --lca-fields full_name \
  --lca-quorum 0.3 \
  --meta-fmt \
  csv
TRINITY_DN279_c1_g1_i5,0,0,,54,2020-11-09 15:48:47,len=503 path=[1:0-102 4:103-274 18:275-285 19:286-327 20:328-502],Ipomoea nil (Japanese morning glory);,BDFN01001194.1.11177.12965~0.559 ,turn-check disabled

I have no idea why the taxonomy looks so different, but what surprises me more is that there is no taxonomic path here.

So, long introduction, my question is:

  1. How can I (based on the SILVA fasta files) create a ARB database with taxonomy fields as defined in the fasta header? and
  2. Why is there no taxonomic path when I use full_name as field in my custom ARB database?

Thank you very much for your help!

@epruesse
Copy link
Owner

Hi @johanneswerner,

the option to generate an ARB file on the fly was meant to allow people unfamiliar with ARB to quickly generate a file SINA can use as a reference. The fasta file is parsed as >$ID $DESCRIPTION with $ID mapped to acc and $DESCRIPTION mapped to full_name. That the SILVA FASTA files have $DESCRIPTION == tax_slv is just happenstance, and nothing SINA would know. Allowing people to customise this is a bit beyond what SINA is meant to do.

So in answer to 1: To create a custom ARB database, use ARB. You can start from a FASTA and import any fields you might like, split/copy parts of the FASTA header as needed, even add your own "import filter" to parse your type of FASTA header correctly.

In answer to 2: I don't know. Try with --copy-fields full_name, so see what the original path was. Since it works with the SILVA database, but does not work with your custom database, it must be the format of the field. Feel free to post a (small) example ARB database here, I'll have a look whether there is something improvable on SINA's side that doesn't impact other use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants