How to set the chromosome number if I use p_ctg.fa from hifiasm when running haphic pipeline #28

utpala101 · 2024-05-22T01:37:56Z

Thank you for your excellent work!
I wonder if I run the haphic pipeline correctly. I have a genome of 4G (a diploid, 2n=26), and I just use the p_ctg.fa file from hifiasm and HiC data. When I run the pipeline I set the nchr as 13, and I have a good result. But when I saw the issue I found that some users used combined hap*.p_ctg file to run the pipeline, and set the nchr as doubled (in my case as 26). So I wonder if I run it correctly, I just want a scaffold genome of haploid, and the HiC contact map of 13 chromosomes by the way.

Appreciate for your early reply!

zengxiaofei · 2024-05-22T02:33:13Z

Yes, you are correct. For your primary or haplotype-collapsed assembly, please set the nchr as 13.

utpala101 · 2024-05-22T07:00:39Z

@zengxiaofei Thank you so much for your early reply, I really appreciate your great work!

melop · 2024-05-28T00:30:37Z

To follow up with this question, the HIFIASM output that I have only produces gfa format files. Should I use gfa2fa tool to convert *.hap1.p_ctg.gfa and *.hap2.p_ctg.gfa into fasta format, concatenate them into one fasta file (asm.fa), then use BWA to map the HiC reads? But in this case, wouldn't most HIC reads receive a MAPQ of 0 due to multiple mapping and become filtered out by the recommended MAPQ filter of 1?

Best regards,
Rongfeng Cui

zengxiaofei · 2024-05-28T09:14:23Z

Hi Rongfeng @melop,

Overall, there are two main strategies for scaffolding haplotype-resolved assemblies (hap*.p_ctg). The first is aligning all Hi-C reads to each haplotype and scaffolding them separately. Although this approach can preserve much more Hi-C reads for scaffolding, it may disregard possible chromosomal structural variations between haplotypes. For the second strategy, Hi-C reads are aligned to all haplotypes simultaneously, as you mentioned. Despite the fact that many Hi-C reads may be filtered out based on the MAPQ > 0 criterion, this method allows for the differentiation of possible chromosomal structural variations between haplotypes on a contact map.

The percentage of reads filtered out due to MAPQ > 0 is influenced by the heterozygosity level of the genome. When filtered Hi-C reads are sufficient for scaffolding, I would suggest using the second strategy. If the heterozygosity level is too low to maintain enough Hi-C reads, the first strategy may be the only choice.

Best,
Xiaofei

melop · 2024-05-29T00:31:10Z

Hi Xiaofei, I'm wondering if it would be possible to develop a tool to compare the agp files produced by both strategies and merge them together, so that you get a better assignment rate and can pickup any structural variation between haplotypes.

zengxiaofei · 2024-05-29T01:38:58Z

Hi Xiaofei, I'm wondering if it would be possible to develop a tool to compare the agp files produced by both strategies and merge them together, so that you get a better assignment rate and can pickup any structural variation between haplotypes.

I think there is a paradox. If the Hi-C reads after filtering are sufficient to scaffold the haplotype-resolved assemblies, there is no need to use the first strategy. Conversely, If the Hi-C reads are inadequate for scaffolding, they will also fail to differentiate the SVs between haplotypes.

Additionally, Hi-C-based scaffolding may produce many errors in chromosome assignment, and ordering and orientation, especially for those complex genomes. Therefore, it is essential to carefully evaluate these potential SVs by using a contact map. Developing a "one-click" tool that can generate publication-quality pseudo-chromosomes remains a challenging task at present.

melop · 2024-06-11T06:57:05Z

Or perhaps if I rephrase my question: is there a way to identify genomic regions where the two haplotypes agree or disagree from HiC data? If so, one could partition the HiC data into haplotype-shared and haplotype-specific for scaffolding.

zengxiaofei · 2024-06-26T08:15:52Z

if I rephrase my question: is there a way to identify genomic regions where the two haplotypes agree or disagree from HiC data? If so, one could partition the HiC data into haplotype-shared and haplotype-specific for scaffolding.

Firstly, there could be a misunderstanding. The criterion MAPQ > 0 that we discussed above is used to filter out multiple mapping reads. That means some regions between haplotypes are identical in their sequences, Hi-C reads (often PE150 in length) cannot differentiate which position is correct (i.e. ambiguous mapping). When you are talking about haplotype-shared and -specific Hi-C reads, maybe you mean the Hi-C reads that can be used to identify SVs. They are totally different as Hi-C reads provide information for scaffolding mainly through the long-range interaction information within them. Imagine that there is a big SV between two haplotypes (e.g., an inversion in one of the haplotype), but the sequences of the two haplotypes are almost identical (except for the breakpoint of the inversion). Only a very small amount of Hi-C reads mapped to the breakpoint (± <150 bp) can be kept after MAPQ >0 filtering.

If you are talking about the Hi-C reads that can be used to identify SVs, there are two converse and contradictory tasks (a "chicken or the eggs" problem): (1) find shared and specific genomic regions between haplotypes using Hi-C reads; (2) find shared and specific Hi-C reads between haplotypes using genome sequences. Or you mean you want to find shared and specific Hi-C reads between haplotypes at the contig level? Even if this is possible, there are too many problems: (1) you need an accurate genome alignment between two haplotypes to liftover the genomic positions from one haplotype to another at the contig level, which may result in new errors; (2) how to integrate the results of haplotype-shared and -specific Hi-C reads?

If you are talking about the Hi-C reads that are filtered out or not filtered out by MAPQ>0. There is still a question: how to use the Hi-C reads with a MAPQ==0? The contigs that are not anchored to any chromosomes due to the MAPQ>0 criterion mean that Hi-C reads on them cannot accurately differentiate which chromosomes these contigs should be assigned to.

zengxiaofei closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set the chromosome number if I use p_ctg.fa from hifiasm when running haphic pipeline #28

How to set the chromosome number if I use p_ctg.fa from hifiasm when running haphic pipeline #28

utpala101 commented May 22, 2024

zengxiaofei commented May 22, 2024

utpala101 commented May 22, 2024

melop commented May 28, 2024

zengxiaofei commented May 28, 2024 •

edited

Loading

melop commented May 29, 2024

zengxiaofei commented May 29, 2024 •

edited

Loading

melop commented Jun 11, 2024

zengxiaofei commented Jun 26, 2024

How to set the chromosome number if I use p_ctg.fa from hifiasm when running haphic pipeline #28

How to set the chromosome number if I use p_ctg.fa from hifiasm when running haphic pipeline #28

Comments

utpala101 commented May 22, 2024

zengxiaofei commented May 22, 2024

utpala101 commented May 22, 2024

melop commented May 28, 2024

zengxiaofei commented May 28, 2024 • edited Loading

melop commented May 29, 2024

zengxiaofei commented May 29, 2024 • edited Loading

melop commented Jun 11, 2024

zengxiaofei commented Jun 26, 2024

zengxiaofei commented May 28, 2024 •

edited

Loading

zengxiaofei commented May 29, 2024 •

edited

Loading