Interpreting methylation haplotypes #5

caalo · 2018-01-19T21:23:57Z

Hi Dinh,

I'm running your tool to extract methylation haplotypes from bam files, and I'm currently comparing the reads displayed in IGV with the methylation haplotypes generated from your tool. I'm seeing discrepancies, and I'm wondering if you can help me interpret the results.

My bam was aligned by bsmap, and the following command was used to generate the haplotype file:
sh scripts/bam2cghap_v1.sh WGBS ng.3805/MHBS.txt /allcpg/hg19.fa.allcpgs.txt.gz test.bam test

I compared the output with IGV:

The screenshot is from IGV's bisulfite CG mode, in which everything in blue are bases that are not methylated at CG motifs, referring to T's in your haplotype block output. If a base is methylated, it would be colored red, referring to C's in your haplotype block output. I have added the genomic coordinates right below the reads in IGV, and below the screenshot is the output from bam2cghap_v1.sh

Something isn't matching up: we don't see any reads with C's in IGV, and the T's doesn't account for all the reads in the region in IGV. Not sure if you have use IGV as a gold standard to compare your methylation haplotypes. Could the aligner (bsmap) be contributing to this?

A second example, from your data:
./scripts/make-mappable-bins.sh BAMfiles/CCT.bam 10
sh scripts/bam2cghap_v1.sh WGBS CCT.RD10_80up.genomecov.bed allcpg/hg19.fa.allcpgs.txt.gz BAMfiles/CCT.bam CCT

And here's the haplotype file in the region:

chr11:377279-377364 CCCCCCCCCCCCCC 1 377282,377288,377304,377312,377314,377316,377321,377324,377336,377340,377344,377346,377348,377356
chr11:377279-377364 CCCCCCCCCCTCCC 1 377282,377288,377304,377312,377314,377316,377321,377324,377336,377340,377344,377346,377348,377356
chr11:377279-377364 CCCCCCCCTTCCCC 1 377282,377288,377304,377312,377314,377316,377321,377324,377336,377340,377344,377346,377348,377356

Again, not all the reads in IGV account for the haplotype file, and I have checked that none of the reads here are PCR duplicates.

Would be interested to hear what you think about this.

Thanks,
Chris

The text was updated successfully, but these errors were encountered:

dinhdiep · 2018-01-20T03:53:56Z

Hi Chris, I have not used IGV in bisulfite CG mode before, so I'm a bit confused. I have used IGV in normal mode, but C (blue) and T (red) on the forward strand informs the methylation for the CpG positions while G (orange) and A (green) informs the methylation for the reverse strand. In the first IGV example, all of the forward reads where correctly turned into haplotypes: first haplotype is from the second read from the top, second haplotype is from combining the PE reads which are third and fifth from the top. But the last haplotypes were interpreted incorrectly. This is known problem with the bam2cghap_v1.sh code which expects a certain SAM flag for reverse strand reads to be considered as reverse strand. I suggest to just use bam2cghap.sh which is the latest version that I know works for BisMark and BisReadMapper bam files or similar. In the second IGV example, bam2cghap_v1.sh was run in WGBS mode, when CCT.bam is from an RRBS library. In WGBS mode bam2cghap_v1.sh will try to remove PCR duplicates based only on the starting map position of the reads therefore, most of those reads were considered duplicates. Also bam2cghap_v1.sh might be having trouble with the SAM flags here too. It looks like three random reads were chosen for each unique starting map position and converted to haplotypes in the output. Let me try to run the code and figure out what happened in the IGV plot for this region. Note that in the latest version of bam2cghap.sh, we took away the remove PCR duplicates since they are better removed with samtools rmdup program. Best, Dinh

…

On Fri, Jan 19, 2018 at 1:23 PM, Chris Lo ***@***.***> wrote: Hi Dinh, I'm running your tool to extract methylation haplotypes from bam files, and I'm currently comparing the reads displayed in IGV with the methylation haplotypes generated from your tool. I'm seeing discrepancies, and I'm wondering if you can help me interpret the results. My bam was aligned by bsmap, and the following command was used to generate the haplotype file: sh scripts/bam2cghap_v1.sh WGBS ng.3805/MHBS.txt /allcpg/hg19.fa.allcpgs.txt.gz test.bam test I compared the output with IGV: [image: example] <https://user-images.githubusercontent.com/17771008/35169087-2ef1157e-fd29-11e7-9a12-a504b5194dd9.png> The screenshot is from IGV's bisulfite CG mode, in which everything in blue are bases that are not methylated at CG motifs, referring to T's in your haplotype block output. If a base is methylated, it would be colored red, referring to C's in your haplotype block output. I have added the genomic coordinates right below the reads in IGV, and below the screenshot is the output from bam2cghap_v1.sh Something isn't matching up: we don't see any reads with C's in IGV, and the T's doesn't account for all the reads in the region in IGV. Not sure if you have use IGV as a gold standard to compare your methylation haplotypes. Could the aligner (bsmap) be contributing to this? A second example, from your data: ./scripts/make-mappable-bins.sh BAMfiles/CCT.bam 10 sh scripts/bam2cghap_v1.sh WGBS CCT.RD10_80up.genomecov.bed allcpg/hg19.fa.allcpgs.txt.gz BAMfiles/CCT.bam CCT [image: screen shot 2018-01-19 at 3 34 00 pm] <https://user-images.githubusercontent.com/17771008/35170577-bd9ac338-fd2e-11e7-9242-d9adddc03b06.png> And here's the haplotype file in the region: chr11:377279-377364 CCCCCCCCCCCCCC 1 377282,377288,377304,377312, 377314,377316,377321,377324,377336,377340,377344,377346,377348,377356 chr11:377279-377364 CCCCCCCCCCTCCC 1 377282,377288,377304,377312, 377314,377316,377321,377324,377336,377340,377344,377346,377348,377356 chr11:377279-377364 CCCCCCCCTTCCCC 1 377282,377288,377304,377312, 377314,377316,377321,377324,377336,377340,377344,377346,377348,377356 Again, not all the reads in IGV account for the haplotype file, and I have checked that none of the reads here are PCR duplicates. Would be interested to hear what you think about this. Thanks, Chris — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAgCIICsgmpmJCN9vqHSSd5_gXOnSSQ4ks5tMQftgaJpZM4RlEaV> .

caalo · 2018-01-26T20:14:38Z

Hi Dinh,

Thanks for your detailed reply regarding this. I first reran the CCT sample with the following commands:

sh ./scripts/bam2cghap_orig.sh allcpg/hg19.fa.allcpgs.txt.gz BAMfiles/CCT.bam CCT
touch CCT_hapinfo_list
echo CCT.cgPE.hapinfo.txt > CCT_hapinfo_list
sh ./scripts/cghap2matrix.sh CCT_hapinfo_list MHL CCT_MHB ng.3805/MHBS.txt

and the looking at CCT.cgPE.hapinfo.txt.merged.hapinfo.txt is easier because I can look at a haplotype region at a time.

Here's one example that I have questions about:

Reading from the top, the first read has the same name as the third read, and the second read has the same read name as the fourth read. Here is the haplotype information:

chr6:33048642:33048705 TTT 1 33048653:33048678:33048687
chr6:33048642:33048705 CCC 1 33048653:33048678:33048687

The second haplotype is accounted by reads 2 and 4 (reading from the top), but I'm not sure how the first haplotype is accounted for. There is a lot of mismatches in the last two reads, and I'm wondering if the tool is using those mismatches as methylation information for the haplotypes.

I see patterns of mismatch in which C->T happens on reads in which G->A is allowed by bisulfite conversion, and vise versa. For example, the third read is Read 2, Positive Strand so it should have G->A mismatches, but I see C->T mismatches instead.

Therefore, I'm wondering if your rules of bisulfite mismatch is different than what I (and IGV) expect to see in pair end bislfuite treated data, because upon running my own data on the commands you suggested above and comparing to IGV the haplotype information still doesn't match. Regarding what you said about the coloring in IGV:

I have used IGV in normal mode, but C (blue) and T (red) on the forward
strand informs the methylation for the CpG positions while G (orange) and A
(green) informs the methylation for the reverse strand.

I don't think that's quite right. The allowable mismatches from bisulfite-treated pair-end sequencing should be:

positive strand, read 1: C->T
positive strand, read 2: G->A
negative strand, read 1: C->T
negative strand, read 2: G->A

Because reads from bams are always stored in the positive orientation, the allowable mismatches for negative strands need to be complemented (which is what we see in IGV):

negative strand, read 1: G->A
negative strand, read 2: C->T

Best,
Chris

dinhdiep · 2018-01-27T00:21:41Z

Hi Chris, I downloaded the CCT.bam file and visualized it in IGV normal mode and what I got matches the haplotypes. [image: Inline image 1] These BAM files were generated such that forward orientation means that the read was from the Crick strand and reverse orientation means that the read was from the Watson strand. Therefore, both read 1 and read 2 in PE reads would be oriented the same way since they both should have mapped to the Crick strand. Furthermore, all forward oriented reads (from Crick) will have C->T and all reverse oriented reads (from Watson) would be G->A regardless of whether they are read 1 or read 2. I believe both BisReadMapper and Bismark will output BAM files in this way because we want to preserve the original strand information for each read. ~ Dinh

…

On Fri, Jan 26, 2018 at 12:14 PM, Chris Lo ***@***.***> wrote: Hi Dinh, Thanks for your detailed reply regarding this. I first reran the CCT sample with the following commands: sh ./scripts/bam2cghap_orig.sh allcpg/hg19.fa.allcpgs.txt.gz BAMfiles/CCT.bam CCT touch CCT_hapinfo_list echo CCT.cgPE.hapinfo.txt > CCT_hapinfo_list sh ./scripts/cghap2matrix.sh CCT_hapinfo_list MHL CCT_MHB ng.3805/MHBS.txt and the looking at CCT.cgPE.hapinfo.txt.merged.hapinfo.txt is easier because I can look at a haplotype region at a time. Here's one example that I have questions about: [image: screen shot 2018-01-26 at 2 34 46 pm] <https://user-images.githubusercontent.com/17771008/35457210-1f741600-02a6-11e8-87fe-061ecb807358.png> Reading from the top, the first read has the same name as the third read, and the second read has the same read name as the fourth read. Here is the haplotype information: chr6:33048642:33048705 TTT 1 33048653:33048678:33048687 chr6:33048642:33048705 CCC 1 33048653:33048678:33048687 The second haplotype is accounted by reads 2 and 4 (reading from the top), but I'm not sure how the first haplotype is accounted for. There is a lot of mismatches in the last two reads, and I'm wondering if the tool is using those mismatches as methylation information for the haplotypes. I see patterns of mismatch in which C->T happens on reads in which G->A is allowed by bisulfite conversion, and vise versa. For example, the third read is Read 2, Positive Strand so it should have G->A mismatches, but I see C->T mismatches instead. Therefore, I'm wondering if your rules of bisulfite mismatch is different than what I (and IGV) expect to see in pair end bislfuite treated data, because upon running my own data on the commands you suggested above and comparing to IGV the haplotype information still doesn't match. Regarding what you said about the coloring in IGV: I have used IGV in normal mode, but C (blue) and T (red) on the forward strand informs the methylation for the CpG positions while G (orange) and A (green) informs the methylation for the reverse strand. I don't think that's quite right. The allowable mismatches from bisulfite-treated pair-end sequencing should be: positive strand, read 1: C->T positive strand, read 2: G->A negative strand, read 1: C->T negative strand, read 2: G->A Because reads from bams are always stored in the positive orientation, the allowable mismatches for negative strands need to be complemented (which is what we see in IGV): negative strand, read 1: G->A negative strand, read 2: C->T Best, Chris — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAgCIM51wpbAI80fQtXAmQDSN13pPboKks5tOjIvgaJpZM4RlEaV> .

caalo · 2018-01-30T18:51:46Z

Hi Dinh,

Is your data generated from a directional bisulfite sequencing protocol, or a non-directional bisulfite sequencing protocol? This might explain the differences we are viewing the data.

Best,
Chris

dinhdiep · 2018-01-30T21:43:37Z

Hi Chris, The RRBS libraries are non-directional. Thanks, Dinh

…

On Tue, Jan 30, 2018 at 10:51 AM, Chris Lo ***@***.***> wrote: Hi Dinh, Is your data generated from a directional bisulfite sequencing protocol, or a non-directional bisulfite sequencing protocol? This might explain the differences we are viewing the data. Best, Chris — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAgCIMn-WMmtHHXSeokofzfnkG0QLE4Lks5tP2TDgaJpZM4RlEaV> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpreting methylation haplotypes #5

Interpreting methylation haplotypes #5

caalo commented Jan 19, 2018

dinhdiep commented Jan 20, 2018 via email

caalo commented Jan 26, 2018

dinhdiep commented Jan 27, 2018 via email

caalo commented Jan 30, 2018

dinhdiep commented Jan 30, 2018 via email

Interpreting methylation haplotypes #5

Interpreting methylation haplotypes #5

Comments

caalo commented Jan 19, 2018

dinhdiep commented Jan 20, 2018 via email

caalo commented Jan 26, 2018

dinhdiep commented Jan 27, 2018 via email

caalo commented Jan 30, 2018

dinhdiep commented Jan 30, 2018 via email