Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreting methylation haplotypes #5

Open
caalo opened this issue Jan 19, 2018 · 5 comments
Open

Interpreting methylation haplotypes #5

caalo opened this issue Jan 19, 2018 · 5 comments

Comments

@caalo
Copy link

caalo commented Jan 19, 2018

Hi Dinh,

I'm running your tool to extract methylation haplotypes from bam files, and I'm currently comparing the reads displayed in IGV with the methylation haplotypes generated from your tool. I'm seeing discrepancies, and I'm wondering if you can help me interpret the results.

My bam was aligned by bsmap, and the following command was used to generate the haplotype file:
sh scripts/bam2cghap_v1.sh WGBS ng.3805/MHBS.txt /allcpg/hg19.fa.allcpgs.txt.gz test.bam test

I compared the output with IGV:

example

The screenshot is from IGV's bisulfite CG mode, in which everything in blue are bases that are not methylated at CG motifs, referring to T's in your haplotype block output. If a base is methylated, it would be colored red, referring to C's in your haplotype block output. I have added the genomic coordinates right below the reads in IGV, and below the screenshot is the output from bam2cghap_v1.sh

Something isn't matching up: we don't see any reads with C's in IGV, and the T's doesn't account for all the reads in the region in IGV. Not sure if you have use IGV as a gold standard to compare your methylation haplotypes. Could the aligner (bsmap) be contributing to this?

A second example, from your data:
./scripts/make-mappable-bins.sh BAMfiles/CCT.bam 10
sh scripts/bam2cghap_v1.sh WGBS CCT.RD10_80up.genomecov.bed allcpg/hg19.fa.allcpgs.txt.gz BAMfiles/CCT.bam CCT

screen shot 2018-01-19 at 3 34 00 pm

And here's the haplotype file in the region:

chr11:377279-377364 CCCCCCCCCCCCCC 1 377282,377288,377304,377312,377314,377316,377321,377324,377336,377340,377344,377346,377348,377356
chr11:377279-377364 CCCCCCCCCCTCCC 1 377282,377288,377304,377312,377314,377316,377321,377324,377336,377340,377344,377346,377348,377356
chr11:377279-377364 CCCCCCCCTTCCCC 1 377282,377288,377304,377312,377314,377316,377321,377324,377336,377340,377344,377346,377348,377356

Again, not all the reads in IGV account for the haplotype file, and I have checked that none of the reads here are PCR duplicates.

Would be interested to hear what you think about this.

Thanks,
Chris

@dinhdiep
Copy link
Owner

dinhdiep commented Jan 20, 2018 via email

@caalo
Copy link
Author

caalo commented Jan 26, 2018

Hi Dinh,

Thanks for your detailed reply regarding this. I first reran the CCT sample with the following commands:

sh ./scripts/bam2cghap_orig.sh allcpg/hg19.fa.allcpgs.txt.gz BAMfiles/CCT.bam CCT
touch CCT_hapinfo_list
echo CCT.cgPE.hapinfo.txt > CCT_hapinfo_list
sh ./scripts/cghap2matrix.sh CCT_hapinfo_list MHL CCT_MHB ng.3805/MHBS.txt

and the looking at CCT.cgPE.hapinfo.txt.merged.hapinfo.txt is easier because I can look at a haplotype region at a time.

Here's one example that I have questions about:

screen shot 2018-01-26 at 2 34 46 pm

Reading from the top, the first read has the same name as the third read, and the second read has the same read name as the fourth read. Here is the haplotype information:

chr6:33048642:33048705 TTT 1 33048653:33048678:33048687
chr6:33048642:33048705 CCC 1 33048653:33048678:33048687

The second haplotype is accounted by reads 2 and 4 (reading from the top), but I'm not sure how the first haplotype is accounted for. There is a lot of mismatches in the last two reads, and I'm wondering if the tool is using those mismatches as methylation information for the haplotypes.

I see patterns of mismatch in which C->T happens on reads in which G->A is allowed by bisulfite conversion, and vise versa. For example, the third read is Read 2, Positive Strand so it should have G->A mismatches, but I see C->T mismatches instead.

Therefore, I'm wondering if your rules of bisulfite mismatch is different than what I (and IGV) expect to see in pair end bislfuite treated data, because upon running my own data on the commands you suggested above and comparing to IGV the haplotype information still doesn't match. Regarding what you said about the coloring in IGV:

I have used IGV in normal mode, but C (blue) and T (red) on the forward
strand informs the methylation for the CpG positions while G (orange) and A
(green) informs the methylation for the reverse strand.

I don't think that's quite right. The allowable mismatches from bisulfite-treated pair-end sequencing should be:

positive strand, read 1: C->T
positive strand, read 2: G->A
negative strand, read 1: C->T
negative strand, read 2: G->A

Because reads from bams are always stored in the positive orientation, the allowable mismatches for negative strands need to be complemented (which is what we see in IGV):

negative strand, read 1: G->A
negative strand, read 2: C->T

Best,
Chris

@dinhdiep
Copy link
Owner

dinhdiep commented Jan 27, 2018 via email

@caalo
Copy link
Author

caalo commented Jan 30, 2018

Hi Dinh,

Is your data generated from a directional bisulfite sequencing protocol, or a non-directional bisulfite sequencing protocol? This might explain the differences we are viewing the data.

Best,
Chris

@dinhdiep
Copy link
Owner

dinhdiep commented Jan 30, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants