Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output HyperLogLog #60

Open
lutfia95 opened this issue Nov 19, 2020 · 3 comments
Open

Output HyperLogLog #60

lutfia95 opened this issue Nov 19, 2020 · 3 comments

Comments

@lutfia95
Copy link

lutfia95 commented Nov 19, 2020

Hi,

I have a question about the output from HLL, when I use Dashing with HyperLogLog i.e.:
./dashing hll -k15 -p2 -S24 read.fastq reference.fasta

The output from HLL is then:
Estimated number of unique exact matches: 2925637.000000

Which kind of matches counts HLL?, I thought the k-mer matches between the Inputs (read and reference).
If the HLL counts the K-mer matches, it shouldn't be 2925637, because my read length is 1628 bp and the reference about 3000000 bp.

My goal is to count the k-mer matches between read and reference. Are the counted matches in HLL between k-mer's or other kind of matches?

Best,

Ahmad

@dnbaker
Copy link
Owner

dnbaker commented Nov 19, 2020

dashing hll simply computes the cardinality of all sequences provided to it, which I don't think is what you want.

If you want to know how many unique k-mers overlapped, then you'd compute dashing cmp -k15 -p2 --sizes read.fastq reference.fasta or dashing cmp --wj-exact -k15 -p2 --sizes read.fastq reference.fasta.

--sizes means it emits the number of unique k-mers in the intersection, and --wj-exact means it emits the total number of k-mers, not the unique number of k-mers that overlap. Does that help?

@lutfia95
Copy link
Author

lutfia95 commented Nov 19, 2020

that helps thanks, I have also a question to be sure
how can I explain the output:

./dashing cmp -k31 -p2 --sizes read.fastq reference_.fasta

#Path Size (est.)
reference.fasta 2824048
read.fastq 1623
##Names reference_.fasta read.fastq
reference.fasta - 1623.46
oneread.fastq - -


2824048: the number of k-mer's in my reference
1623: the number of k-mer's in my read
1623.46 Is it the total number of k-mers that overlap? because I am not sure about this number exactly.

If I run:

./dashing cmp --wj-exact -k31 -p2 read.fastq reference.fasta

Is the last ouput : 0.00054983
Could you please explain to me, what is the both outpus mean?
Thanks,

@dnbaker
Copy link
Owner

dnbaker commented Nov 19, 2020

Hi,
The first one means that by its estimate, the smaller sequence is almost entirely contained in the larger sequence.
The second command-line says that 0.05% of the k-mers in the union are shared. If you were to add --sizes to it, it would emit something close to 1623. (1623 / 2824048 ~= .0005)

--sizes causes the number emitted to be an approximate number of k-mers, while the default is jaccard similarity (fraction of shared k-mers).

If you want to get rid of the randomness from the sketch, you can --use-full-khash-sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants