Skip to content
Shijia Zhu edited this page Feb 26, 2018 · 1 revision

SMRTER: Single Molecule Real Time sequencing Epigenetic Refinement

N6-methyladenine (m6dA) has been discovered as a novel form of DNA methylation prevalent in eukaryotes, however, methods for high resolution mapping of m6dA events are still lacking. Single-molecule real-time (SMRT) sequencing has enabled the detection of m6dA events at single-nucleotide resolution in prokaryotic genomes, but its application to detecting m6dA in eukaryotic genomes has not been rigorously explored. Herein, we identified unique characteristics of eukaryotic m6dA methylomes that fundamentally differ from those of prokaryotes. Based on these differences, we describe the first approach for mapping m6dA events using SMRT sequencing specifically designed for the study of eukaryotic genomes, and provide appropriate strategies for designing experiments and carrying out sequencing in future studies. We apply the novel approach to study m6dA in green algae, and for the first time, human. We demonstrate a general method and guideline for mapping and rigorous characterization of m6dA events in eukaryotic genomes.

Contents

  1. Basic concepts
  2. Import SMRTER dependent R packages
  3. SMRTER input data
  4. Loading test data
  5. Calling FDR for multiple test correction
  6. Unbiased discovery of m6dA motifs
  7. Fraction calling at single molecule level
  8. Single-molecule strand-specific characterization
  9. Overlapping between SMRT-seq and other N6mA-detection methods
  10. Consensus analysis gives powerful and stable signals

1. Basic concepts

SMRTER loads IPDs observed at each position in the native DNA, and compares those IPDs to the values expected for the unmodified DNA, and performed the statistical analysis from the single molecule level.

  • native DNA the genome of interest, where there are modified bases
  • IPD SMRT sequencing monitors not only the pulse fluorescence associated with each incorporated nucleotide, but also the time between the incorporation events, termed the inter-pulse duration (IPD)
  • IPD ratio the ratio between the IPDs in native DNA and that in unmodified DNA at the same locus. IPD ratio is highly correlated with the presence of modifications of the nucleotide. The expected IPD value for unmodified DNA can come from either an in-silico control or a traditional control.
  • in-silico control trained by PacBio and shipped with the SMRTportal package. It predicts the IPD using the local sequence context around the current position
  • traditional control generated by sequencing unmodified DNA with the same sequence as the test sample. The traditional control sample is usually generated by whole-genome amplification (WGA) of the original sample.

Back to above

2. Import SMRTER dependent R packages

SMRTER depends on the R packages: "h5r", "pbh5" for cmp.h5 file operating, "Biostrings" for genome sequence analysis, "foreach", "doMC" for parallel computing, "ggplot2" and "RColorBrewer" for figure plotting. The following code is to import the SMRTER dependent R packages.

library( h5r )
library( pbh5 )
library( Biostrings )
library( foreach )
library( doMC )
library( ggplot2 )
library( RColorBrewer )
library( SMRTER )

Back to above

3. SMRTER input data

SMRTER does not use the raw SMRTseq reads (e.g. bax, bas, fastq files), but directly take as input the output from SMRTportal, so as to save from the sequencing alignments and summary of basic information. Two types of files are used: cmp.h5 and modifications.csv. The cmp.h5 file is a PacBio alignment file containing alignments and IPD information, supplies the kinetic data used to perform modification detection; the modifications.csv file is a comma-separated file, containing various statistics for the results of modification detection. The detailed explanation of each column in the modifications.csv file refers to kineticsTools. In our analysis, the modifications.csv file is required to at least have the following columns:

  1. refName reference sequence ID of this observation
  2. tpl 1-based template position
  3. strand '0' is the reference strand, '1' is opposite
  4. base the cognate base at this position in the reference
  5. score Phred-transformed pvalue that a kinetic deviation exists at this position
  6. coverage count of valid IPDs for the in-silico control mode or the mean of case and control coverage for the case-control mode
  7. ipdRatio the ratio between mean IPD in native DNA and that in control (either in-silico control or WGA)
  8. modelPrediction/controlMean normalized mean IPD for the in-silico control mode/case-control mode

To make it easier for the users to test the codes, we provide an illustrative SMRTseq dataset. It can be downloaded here (1.04G). The original data respectively used 20 and 18 SMRT cells to sequence the native DNA and WGA of green algae genome. We take as illustrative example the reads mapped to the scaffold 18 (Chlamydomonas genome JGI Version 9.1). It includes the following datasets:

  1. Algae_scaffold_18.fasta the reference sequence of green algae scaffold 18
  2. nat_modifications.csv the modifications.csv file of native DNA;
  3. nat_aligned_reads.cmp.h5 the cmp.h5 file of native DNA;
  4. wga_modifications.csv the modifications.csv file of WGA;
  5. wga_aligned_reads.cmp.h5 the cmp.h5 file of WGA.

Back to above

4. Loading test data

Users can first load the modifications.csv data and explore the format:

# check the help page of readSmrtCsv() function
?readSmrtCsv

# load the modifications.csv file of native DNA
nat <- readSmrtCsv( "nat_modifications.csv" )
wga <- readSmrtCsv( "wga_modifications.csv"  )

# filter the rows for only base 'A'
natA <- subset( nat , base=='A' )
wgaA <- subset( wga , base=='A' ) 

# check the format of the modifications.csv file
head(natA)
head(wgaA)

# load the reference
scaffold_18 = readDNAStringSet( "Algae_scaffold_18.fasta" )

Back to above

5. Calling FDR for multiple test correction

The genome-wide mapping of 5-methylcytosine (5mC) via bisulfite sequencing builds on the accurate base calling of Illumina sequencing. In contrast, SMRT sequencing-based detection of DNA modifications is facilitated through statistical tests comparing the observed distribution of IPD values at each nucleotide locus with the expected IPD value of the same base in the same sequence context, but without methylation. The latter IPD value is estimated from whole genome amplified (WGA; methylation free) samples. For a given genome, millions or billions of nucleotides are tested, making false positive calls due to multiple hypotheses testing a serious concern. Bonferroni correction and false discovery rate (FDR) can be used to account for the multiple hypothesis testing, however, Bonferroni correction is conservative and rigorous, so we choose FDR calculated by comparing the global distribution of IPD ratios (or p values from student’s T-test) in native and WGA samples. For a given genome, the FDR of m6dA detection conceptually reflects the fraction of false positives among total m6dA calls and depends on two major factors: the fraction of methylated adenines across the genome; and per-strand sequencing depth, (i.e. average number of IPD values for each strand of the genome reference). Users can use the following code to call FDR by comparing the IPD ratio of As in native DNA with WGA. As shown in Fig.1, IPD ratio of 4.5 shows FDR of 0.05 (blue dash).

# call FDR by comparing the IPD ratio of As in native DNA with WGA as background
fdr <- callFDR( natA$ipdRatio , wgaA$ipdRatio , thresList=seq(1,5,0.1) )

motif calling

Fig.1. FDR calling by comparing the IPD ratios of As in native DNA with those in WGA. The figures show the IPD ratio density plots of native DNA (red) and WGA (black). The right figure is enlarged from the left one. The blue dash represents FDR=0.05 at IPD ratio of 4.5.

Since the FDR depends on the fraction of methylated adenines across the genome, we can further improve it by focusing on only methylated motifs. Followed is an example of FDR calling by comparing the V[A]TB sites in native DNA with the As in WGA (Fig.2). The green dash represents the FDR=0.01 at IPD ratio of 4. This further motivates us to propose a method to call methylation motifs for the eukaryote genome.

# filter only As in the VATB motifs from the native DNA 
natVATB <- getSmrtOfMotif( smrt=natA , motif="VATB" , methy_pos=2 , genome=scaffold_18 )
# call FDR by comparing the IPD ratio of V[A]TB in native DNA with WGA as background
fdrVATB <- callFDR( natVATB$ipdRatio , wgaA$ipdRatio , thresList=seq(1,5,0.1)  )

motif calling

Fig.2 Motif specific methylation detection leads to more reliable m6dA calls with lower FDRs.

Back to above

6. Unbiased discovery of m6dA motifs

In contrast to bacteria, m6dA motifs in eukaryotic species are typically only weakly motif-driven, i.e. the motif-specific fraction of methylated motif sites across the genome, is often very low. When a motif is only methylated at a low fraction across the genome, it becomes difficult to differentiate between the enrichment of the motif due to m6dA events and the general enrichment of the motif reflecting the intrinsic sequence composition of an eukaryotic genome or certain regions of interests. To address this challenge in SMRT sequencing-based m6dA motif enrichment analysis, we develop a motif enrichment score that is calculated as the odds ratio between the frequency of a motif among putative m6dA sites (IPD ratio>r) and the frequency of the motif among all adenine sites in the genome. The reciprocal of this enrichment score approximates the FDR of the motif sites with m6dA events. Followed is the heatmap for the enrichment score (Fig.3). This is only a testing data from algae scaffold18, so it might be not exactly the same with the whole genome.

# call motif enrichment score for 4-mer putative m6dA motifs 
motifFdr <- callFDR_eachMotif( smrt=natA , genome=scaffold_18 , kmer=4 , meth_char = 'A' , cores=10 , thres=4 ) 
# draw heatmap for the enrichment score
drawMotifHeatmap( motifFdr , outputName="natA_vs_insilico.pdf" )

motif calling

Fig.3. A rigorous motif enrichment analysis reveals that V[A]TB (V = A, C or G and B = C, G or T) is the m6dA motif of in C. reinhardtii. Each 4x4 heatmap corresponds to all sixteen 4-mer motifs, for which 2nd and 3rd bases are fixed at the center/title (e.g. [A]A). The rows and columns in the heatmaps represent the first and last bases of a 4-mer motif. Each cell in the following 4x4 heatmaps shows the motif enrichment score based on native DNA sample.

Back to above

7. Fraction calling at single molecule level

Single-molecule analysis to estimate partial m6dA methylation. In the above methods and analyses, m6dA calling relies on IPDs pooled from separate molecules for each genomic locus. This aggregated analysis works well when each m6dA locus has nearly 100% methylation across all molecules. However, epigenetic heterogeneity is often observed in both bacteria and eukaryotic species, where only a fraction of cells are methylated at each genomic locus, i.e. partial methylation. Partial methylation is characterized by a locus-specific fraction of methylation . To better estimate partial methylation and study cell-to-cell m6dA heterogeneity in eukaryotic genomes, we developed a method for single molecule-resolution analysis of SMRT sequencing data. In brief, IPDs are grouped by molecules for each genomic site and compared to the expected IPD values. The IPD ratios of methylated (m6dA) and non-methylated sites (at single-molecule level) follow two normal distributions with means of 1 and ~4 and variances that decrease as per-molecule, per-strand sequencing coverage increases. Using m6dA sites with different levels of methylation fraction, subsampled from a well-characterized E. coli m6dA methylome, we found the single-molecule level analysis provides a more accurate estimation of partial methylation than existing aggregate methods without single-molecule level analysis. Two modes are implemented by SMRTER: in-silico control and traditional control (WGA):

  1. in-silico control:
fracInsilico <- smrterSyntheticControl( path="nat_aligned_reads.cmp.h5" , sitesInfo_interest=natVATB , split_chunk_num=2000 , cores=20 , type = 'callMod' , coverage_per_molecule=3 , number_of_molecule=5 , coverage_thres=15 , score_thres=0 )
resInsilico <- do.call( rbind , fracInsilico )
quantile( subset(resInsilico,ipdRatio>4.5)$frac , seq(0,1,0.1) , na.rm=T )	

  1. traditional control:
fracTraditional <- smrterTraditionalControl( case_path="nat_aligned_reads.cmp.h5" , cont_path="wga_aligned_reads.cmp.h5" , sitesInfo_interest=natVATB , split_chunk_num=2000 , cores=20 , type = 'callMod' , coverage_per_molecule=3 , number_of_molecule=5 , case_coverage=15 , cont_coverage=10 , score_thres=0 )
resTraditional <- do.call( rbind , fracTraditional )
quantile( subset(resTraditional,ipdRatio>4.5)$frac , seq(0,1,0.1) , na.rm=T )	

Users can simply explore the parameters and output values by using ?smrterSyntheticControl and ?smrterTraditionalControl to check the help page. We also explain the parameters as follows:

  • path a character value indicating the file path of .cmp.h5 file for native DNA in the in-silico control mode
  • case_path and cont_path character values indicating the file paths of .cmp.h5 file for native and WGA in the traditional control mode
  • sitesInfo_interest a data frame of smrt data indicating the sites of interest for the single molecule analysis. It should cover the basic information for those sites, including at least "refName", "strand", "tpl", and "modelPrediction"
  • split_chunk_num an integer value indicating the number of split chunks for parallel computing
  • cores an integer value indicating the number of computer cores for parallel computing
  • type a character values indicating the types of analysis. It takes on either "callMod" or "SMipdRatio", where "callMod" represents calling methylation fraction, and "SMipdRatio" represents collecting IPD ratio of single molecule for the following analysis
  • coverage_per_molecule an integer value indicating the threshold of number of subreads covering one single site from one molecule
  • number_of_molecule an integer value indicating the threshold of number of molecules covering one single site
  • coverage_thres an integer value indicating the threshold of number of subreads covering one single site from all molecules
  • score_thres an integer value indicating the threshold of modification score to filter sites of interest

The type of "callMod" returns a list with the name of refName and strand {refName.strand}. Each item is a data frame comprising the following columns:

  • tpl an integer value indicating the genomic location
  • tMean a numeric value indicating the capped mean of normalized IPDs observed at this position
  • tErr a numeric value indicating the capped standard error of normalized IPDs observed at this position
  • coverage an integer value indicating the count of valid IPDs
  • molNum an integer value indicating the count of valid molecules
  • modelPrediction a numeric value indicating the normalized mean IPD
  • ipdRatio a numeric value indicating tMean/modelPrediction
  • score an integer value indicating the Phred-transformed pvalue of modification detection
  • ipdRatio2 a numeric value indicating the mean ipd Ratio of modified molecules at a single site. Two Gaussian mixture models are learned for unmodified and modified distributions, where the mean for the unmodified distribution is fixed as 1, and the mean for the modified is learned (i.e., ipdRatio2)
  • frac a numeric value indicating the fraction of modified molecules at a single site estimated from the single molecule level

Back to above

8. Single-molecule strand-specific characterization

A unique advantage of SMRT sequencing is the ability to examine methylation states of the two reverse complementary VATB sites at the two strands of each molecule. This allows us to further characterize m6dA events at VATB sites in terms of full-, non- or hemi-methylation states at single-molecule resolution with strand specificity. We examined the methylated VATB sites detected by SMRT sequencing (FDR <0.05; Fig. 4). Consistently, most examined molecules were fully methylated on both strands (Fig. 4; top right corners). We also find that some VATB sites were hemi-methylated (Fig. 4; top left and bottom right corners), which could be right after DNA replication forks and haven’t been fully methylated yet. Interestingly, some VATB sites were non-methylated on both strands of single molecules (Fig. 4; bottom left corners), despite these loci having high consensus m6dA methylation levels, indicating possible cell-to-cell m6dA heterogeneity.

SMipdRatio <- smrterSyntheticControl( path="nat_aligned_reads.cmp.h5" , sitesInfo_interest= subset(natVATB,ipdRatio>4.5) , split_chunk_num=2000 , cores=20 , type = 'SMipdRatio' , coverage_per_molecule=3 , number_of_molecule=5 , coverage_thres=15 , score_thres=0 )
pdf("hemiMethylation.pdf")
ipdRatio.2str <- hemiMethylation(SMipdRatio,draw=TRUE)
dev.off()	

The type of "SMipdRatio" returns a list with the name of refName and strand {refName.strand}. Each item is a data frame comprising the following columns:

  • tpl an integer value indicating the genomic location
  • ipdRatio a numeric value indicating the IPD ratio at this position of one specific molecule

motif calling

Fig.4. Single molecule, strand-specific analysis of SMRT-seq data to examine full-, non- or hemi-methylation status at m6dA sites in green algae scaffold 18. The X- and Y- axes denote the single molecule, strand-specific IPD ratio of each pair of reverse-complementary VATB sites at the two strands of each single molecule.

Back to above

9. Overlapping between SMRT-seq and other N6mA-detection methods

In addition to SMRTseq, several other methods have been developed to map m6dA in eukaryotic genomes. DNA immunoprecipitation (DIP) with anti-N6-methyladenine antibodies followed by next-generation sequencing (m6dA-DIP-Seq) has identified genomic regions enriched for m6dA events in several species. The combination of m6dA-DIP-Seq and exonuclease digestion (m6dA-CLIP-exo-seq) provides increased resolution. However, due to the nature of antibody-based methods, both m6dA-DIP-seq and m6dA-CLIP-exo-seq, lack the ability to identify m6dA events at single nucleotide resolution. Furthermore, the immunoprecipitation process loses information necessary to study cell-to-cell epigenetic heterogeneity (i.e. partial methylation) in the cell population of interest. Thus, these antibody-based approaches are limited with respect to their ability to elucidate the characteristics of m6dA events at high resolution. A complementary approach was developed using m6dA-sensitive or m6dA-dependent restriction enzymes (REs), both of which provide single-nucleotide resolution and enable estimates of partial methylation at each nucleotide position. However, these restriction enzyme-based methods have a fundamental limitation in that they can only examine a limited set of motif sites (specific to the restriction enzymes used) and therefore provide a largely incomplete view of m6dAmethylome in any given organism. In spite of limitation, these techniques can provide independent and complementary information for SMRT-seq to more confidently call m6dA. We therefore show the codes as follows, taking m6dA-DIP-seq as an independent method to cross-validate the SMRT-seq-detected m6dA. The testing dataset for green algae chromosome_1 can be found here. It includes the following datasets:

  1. Algae_chromosome_1.fasta the reference sequence of green algae chromosome_1
  2. gene_annot_chromosome_1.txt the gene annotation
  3. nat_modifications_chromosome_1.csv the SMRT-seq modifications.csv file
  4. m6dA_DIPseq_chromosome_1.txt the m6dA-DIP-seq peak bed file
# load data of green algae chromosome_1
# reference of chromosome_1
chromosome_1 <- readDNAStringSet("Algae_chromosome_1.fasta") 

# gene annotation
gene <- read.table( 'gene_annot_chromosome_1.txt' , sep='\t') 

# m6dA-DIP-seq peak bed file
dip <- read.table('m6dA_DIPseq_chromosome_1.txt',sep='\t') 

# SMRT-seq modification csv file
natChr1VATB  <- readSmrtCsv( 'nat_modifications_chromosome_1.csv' ) 

# filter IPD>4.5 (FDR<0.05) as putative m6dA 
methVATB <- subset( natChr1VATB , ipdRatio>4.5 ) 

Fig. 5 shows the overlap between SMRT-seq and m6dA-DIP-seq. The overlap increases at more stringent cutoffs in SMRT-seq on per strand coverage and T-test p values (score).

# overlap between SMRT-seq and m6dA-DIP-seq
overlap = overlapWithBed( smrt=methVATB , bed=dip , chr='V1' , start='V2' , end='V3' , genome_size=nchar(chromosome_1) , coverage_thres=c(0,10,11,20,21,30) , score_thres=c(0,20,21,40,41,100) , meth_base='A' , tags = NULL , cores=10 )
subset(overlap,total_smrt>10 & !is.na(pval))  	

Notably, the SMRT-seq modifications.csv file was used here to overlap with m6dA-DIP-seq, since m6dA in green algae is largely fully methylated. However, for other eukaryotes genome, with largely partial m6dA, the SMRT-seq modification modifications.gff file is strongly recommended, because it can give more confident m6dA calling.

Overlap

Fig.5. Overlap between N6mA events (V[A]TB) detected by SMRT-seq (single nucleotide calls) and N6mA-DIP-seq (peaks).

Back to above

10. Consensus analysis gives powerful and stable signals

Notably, compared to prokaryotes, it is still much more difficult to call single nucleotide resolution m6dA in eukaryotes. The Adenine in eukaryotes is largely partially methylated, and the IPD ratio of that Adenine could be very low due to a very small single base methylation fraction, therefore raising challenge for cutoff selection; moreover, the coverage required to achieve a certain level of FDR estimated is specifically for single nucleotide resolution m6dA calling. However, for other types of epigenomic analysis such as motif discovery and consensus analysis across multiple genomic sites (e.g. transcription start sites etc.), the requirement on the methylation fraction and sequencing depth can be lower depending on specific m6dA patterns in different organisms. As follows, we give the codes to investigate the consensus pattern of IPD ratio distribution around green algae TSS, where the average IPD ratio (normalized by motif frequency) for V[A]TB motifs are plotted around 1,253 TSS’s. Here, a periodic pattern of IPD ratio was observed (Fig. 6), and by further investigation, this pattern is shown to be inversely correlated with the nucleosome positioning.

# consensus analysis along gene TSS
consensus = getConsensus( smrt=natChr1VATB , window=2000 , annot=gene , chr='V1' , strand='V7' , start='V4' , criterion="ipdRatio" , smooth=10 , cores=10 )
pdf('consensus.pdf',h=5,w=10)
plot(c(-2000:2000),consensus,type='l',main='Consensus analysis of V[A]TB at gene TSS',xlab='Distance to TSS (bp)',ylab='Normalized IPD ratio of V[A]TB')
dev.off()

consensus

Fig.6. The consensus showed that V[A]TB motifs have a periodic pattern of IPD ratio distribution around TSS’s.

Back to above