Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RefCall overwrites variant #318

Open
drtconway opened this issue Jul 2, 2024 · 3 comments
Open

RefCall overwrites variant #318

drtconway opened this issue Jul 2, 2024 · 3 comments

Comments

@drtconway
Copy link

Hi!

Thanks for making a good tool!

We have been doing testing and benchmarking with (clinical) trio data, and have found a corner case where the behaviour of the merge between the pileup calls and the full alignment call results in a real variant being discarded.

The pileup VCF file contains the row

chr12   132836482   .   A   ACTCACAGTGACAGGCTCCCAGCAGGGCGCACGGCACTCACAGTGACAGGCTCCCAGCACGGCGCACGGCACTCACAGTGACAGGCTCCCAGCACGGCGCTCGGCC  28.66   PASS    P   GT:GQ:DP:AD:AF:PL   1/1:28:32:3,29:0.9062:56,48,0

And the full alignment has the row

chr12   132836482   .   A   .   0.00    RefCall F   GT:GQ:DP:AD:AF:PL   0/0:0:32:3:0.0938:990

The VCF produced by MergeVcf outputs the RefCall, which causes the insertion to be dropped.

It's a bit of a corner case, since the position of insertion variants in a VCF is the base before the inserted sequence, and that is ref, but in this case, we do want to report the insertion!

I don't fully understand how the positions of the RefCall variants are determined, so I can't tell if this is a freak event, or if there is a systematic process that means these collisions are likely to occur elsewhere. If you could explain how these RefCall events are generated, that would help us understand how the caller works.

Looking at the code of MergeVcf (e.g.

full_alignment_output_set.add((ctg_name, pos))
and
if (ctg_name, pos) in full_alignment_output_set:
), I think I understand how the merge code is dropping the insertion event. I can think of a few possible ways to fix the code, but I haven't thought through all the semantics in detail, to work out what the best way to fix the code would be.

Again, thanks for making a good tool, and I hope this report helps you make it even better.

Tom.

@aquaskyline
Copy link
Member

Many thanks Tom. I have some ideas how to fix it but can't tell which one does less harm to other variant candidates. Is it possible that you send me a minibam containing only the reads covering chr12:132836482? I gonna look into the details.

@drtconway
Copy link
Author

I'll have to consult - as I indicated, it's a clinical sample.

The smallest expansion would be to track RefCall entries, either by adding to the tuple in the set, or perhaps keeping a separate set. Then, if the pileup variant is an indel, it "overrides" the refcall. As noted, indels are kind of special in the sense that they are recorded at a ref position immediately before the actual variant. If the variant from the full alignment is not a RefCall then it is probably ok to drop the variant from the pileup.

If you're keen, I would suggest refactoring the code to do a "classic" merge - iterating over the two VCFs at the same time. That way the cases where things collide will appear as a natural clause in the inner loop, and you don't need memory proportional to the size of the full alignment VCF.

@drtconway
Copy link
Author

Unfortunately, I can't give you a BAM with the reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants