Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set the chromosome number if I use p_ctg.fa from hifiasm when running haphic pipeline #28

Closed
utpala101 opened this issue May 22, 2024 · 8 comments

Comments

@utpala101
Copy link

Thank you for your excellent work!
I wonder if I run the haphic pipeline correctly. I have a genome of 4G (a diploid, 2n=26), and I just use the p_ctg.fa file from hifiasm and HiC data. When I run the pipeline I set the nchr as 13, and I have a good result. But when I saw the issue I found that some users used combined hap*.p_ctg file to run the pipeline, and set the nchr as doubled (in my case as 26). So I wonder if I run it correctly, I just want a scaffold genome of haploid, and the HiC contact map of 13 chromosomes by the way.

Appreciate for your early reply!

@zengxiaofei
Copy link
Owner

Yes, you are correct. For your primary or haplotype-collapsed assembly, please set the nchr as 13.

@utpala101
Copy link
Author

@zengxiaofei Thank you so much for your early reply, I really appreciate your great work!

@melop
Copy link

melop commented May 28, 2024

To follow up with this question, the HIFIASM output that I have only produces gfa format files. Should I use gfa2fa tool to convert *.hap1.p_ctg.gfa and *.hap2.p_ctg.gfa into fasta format, concatenate them into one fasta file (asm.fa), then use BWA to map the HiC reads? But in this case, wouldn't most HIC reads receive a MAPQ of 0 due to multiple mapping and become filtered out by the recommended MAPQ filter of 1?

Best regards,
Rongfeng Cui

@zengxiaofei
Copy link
Owner

zengxiaofei commented May 28, 2024

Hi Rongfeng @melop,

Overall, there are two main strategies for scaffolding haplotype-resolved assemblies (hap*.p_ctg). The first is aligning all Hi-C reads to each haplotype and scaffolding them separately. Although this approach can preserve much more Hi-C reads for scaffolding, it may disregard possible chromosomal structural variations between haplotypes. For the second strategy, Hi-C reads are aligned to all haplotypes simultaneously, as you mentioned. Despite the fact that many Hi-C reads may be filtered out based on the MAPQ > 0 criterion, this method allows for the differentiation of possible chromosomal structural variations between haplotypes on a contact map.

The percentage of reads filtered out due to MAPQ > 0 is influenced by the heterozygosity level of the genome. When filtered Hi-C reads are sufficient for scaffolding, I would suggest using the second strategy. If the heterozygosity level is too low to maintain enough Hi-C reads, the first strategy may be the only choice.

Best,
Xiaofei

@melop
Copy link

melop commented May 29, 2024

Hi Xiaofei, I'm wondering if it would be possible to develop a tool to compare the agp files produced by both strategies and merge them together, so that you get a better assignment rate and can pickup any structural variation between haplotypes.

@zengxiaofei
Copy link
Owner

zengxiaofei commented May 29, 2024

Hi Xiaofei, I'm wondering if it would be possible to develop a tool to compare the agp files produced by both strategies and merge them together, so that you get a better assignment rate and can pickup any structural variation between haplotypes.

I think there is a paradox. If the Hi-C reads after filtering are sufficient to scaffold the haplotype-resolved assemblies, there is no need to use the first strategy. Conversely, If the Hi-C reads are inadequate for scaffolding, they will also fail to differentiate the SVs between haplotypes.

Additionally, Hi-C-based scaffolding may produce many errors in chromosome assignment, and ordering and orientation, especially for those complex genomes. Therefore, it is essential to carefully evaluate these potential SVs by using a contact map. Developing a "one-click" tool that can generate publication-quality pseudo-chromosomes remains a challenging task at present.

@melop
Copy link

melop commented Jun 11, 2024

Or perhaps if I rephrase my question: is there a way to identify genomic regions where the two haplotypes agree or disagree from HiC data? If so, one could partition the HiC data into haplotype-shared and haplotype-specific for scaffolding.

@zengxiaofei
Copy link
Owner

if I rephrase my question: is there a way to identify genomic regions where the two haplotypes agree or disagree from HiC data? If so, one could partition the HiC data into haplotype-shared and haplotype-specific for scaffolding.

Firstly, there could be a misunderstanding. The criterion MAPQ > 0 that we discussed above is used to filter out multiple mapping reads. That means some regions between haplotypes are identical in their sequences, Hi-C reads (often PE150 in length) cannot differentiate which position is correct (i.e. ambiguous mapping). When you are talking about haplotype-shared and -specific Hi-C reads, maybe you mean the Hi-C reads that can be used to identify SVs. They are totally different as Hi-C reads provide information for scaffolding mainly through the long-range interaction information within them. Imagine that there is a big SV between two haplotypes (e.g., an inversion in one of the haplotype), but the sequences of the two haplotypes are almost identical (except for the breakpoint of the inversion). Only a very small amount of Hi-C reads mapped to the breakpoint (± <150 bp) can be kept after MAPQ >0 filtering.

If you are talking about the Hi-C reads that can be used to identify SVs, there are two converse and contradictory tasks (a "chicken or the eggs" problem): (1) find shared and specific genomic regions between haplotypes using Hi-C reads; (2) find shared and specific Hi-C reads between haplotypes using genome sequences. Or you mean you want to find shared and specific Hi-C reads between haplotypes at the contig level? Even if this is possible, there are too many problems: (1) you need an accurate genome alignment between two haplotypes to liftover the genomic positions from one haplotype to another at the contig level, which may result in new errors; (2) how to integrate the results of haplotype-shared and -specific Hi-C reads?

If you are talking about the Hi-C reads that are filtered out or not filtered out by MAPQ>0. There is still a question: how to use the Hi-C reads with a MAPQ==0? The contigs that are not anchored to any chromosomes due to the MAPQ>0 criterion mean that Hi-C reads on them cannot accurately differentiate which chromosomes these contigs should be assigned to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants