Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with LCA quorum #93

Open
jgerken opened this issue Aug 17, 2020 · 7 comments
Open

Problems with LCA quorum #93

jgerken opened this issue Aug 17, 2020 · 7 comments
Assignees
Labels
Milestone

Comments

@jgerken
Copy link

jgerken commented Aug 17, 2020

Hi Elmar,

we stumbled across a weird problem when classifying sequences with SINA. One of our users searched for a partial sequences which as classified as Archaea;Halobacterota; despite eight of the ten nearest neighbours having a classification of Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus;.

I created a minimal working example with a SILVA dataset reduced to the ten nearest neighbours. When you execute

./sina -i aligned.query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.95 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:26:42 [log] lca_tax_slv: Archaea;Halobacterota;

I then searched for a very similar full-length sequence in SILVA an ran the following query:

./sina -i aligned.full-length-query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.5 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:26:42 [log] lca_tax_slv: Archaea;Halobacterota;

But when I change the the --lcq-quorum to 0.79 I get the expected result:

./sina -i aligned.full-length-query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.5 --lca-quorum 0.79 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:30:17 [log] lca_tax_slv: Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus;

Between the 2nd and the 3rd example, there is a floating point precision problem. I think the ratio that is compared with the value of lca_quorum needs to be rounded according to the precision of the floating point type used.

The difference between the first and the other two examples is that the first run returns one of the two sequences with a deviating classification of Archaea;Halobacterota;Methanosarcinia;Methanosarciniales;Methanosarcinaceae;Methanosarcina; as most similar neighbour whereas in the other two runs a sequence of the majority is selected. Thus, it seems that the LCA quorum
always tries to use the taxonomy of the most similar neighbour instead of trying find the most common classification that fulfils the quorum.

Could you have a look?

Thanks
Jan

minimal-example.tar.gz

@epruesse epruesse self-assigned this Aug 17, 2020
@epruesse epruesse added the bug label Aug 17, 2020
@epruesse
Copy link
Owner

I'll have a look at it. Note on the side, you can write -vvv instead of -v -v -v. I assume you were using the 1.7.0?

@jgerken
Copy link
Author

jgerken commented Aug 17, 2020

Thanks :) Yes, I used 1.7.0 but also tried 1.6.1 and 1.2.11 and the classification result was the same with all versions (did not check the nearest neighbors in details with the old versions, though).

@epruesse
Copy link
Owner

mind if I add the example to the unit tests?

@epruesse
Copy link
Owner

Simple enough. Classic int/float truncation issue. Should work now. I'm pushing 1.7.1, should be in bioconda within a day or so.

@jgerken
Copy link
Author

jgerken commented Aug 19, 2020

mind if I add the example to the unit tests?

No, go ahead.

The submitted fix only fixes the difference between the second and third execution. The problem that the first test case

./sina -i aligned.query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.95 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -vvv

returns Archaea;Halobacterota; instead of Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus; despite 8 of the 10 top hits having the latter classification is still present in 1.7.1.

@epruesse
Copy link
Owner

Darn - didn't see that part. Case of TL'DR I suppose.

@epruesse epruesse reopened this Aug 19, 2020
@epruesse
Copy link
Owner

You are correct in your observation. The algorithm does an LCA between the best match and the others. Guess that was usually so close to what it's supposed to do that no-one notices. This will take me a day or two to fix. Don't have enough time in the evenings.

@epruesse epruesse added this to the 1.8.0 milestone Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants