Skip to content

Generalized method for the identification of DMRs from low coverage data

Notifications You must be signed in to change notification settings

dinhdiep/cgDMR-miner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

cgDMR-miner

Efficient method for the identification of differentially methylated CpG regions (DMRs) on whole genome bisulfite sequencing datasets with multiple groups

Download and installation:

Download cgDMR-miner from github: git clone http://github.com/dinhdiep/cgDMR-miner.git

The cgDMR-miner Perl program is located in the "src" directory.

Example usage:

After downloading cgDMR-miner, you will see an "Example" directory. You may run a preliminary test using the examples in this folder.

To run with the Jensen-Shannon divergence variability scores, navigate into the "Example" folder and run using command line:

../src/cgDMR-miner.pl -i mf_list -o bsseq_jsd -m jsd -c cpg_pos_table -s yes -p 0.01

Required inputs are "mf_list" file and "cpg_pos_table" file. See the example files for formating input. Sample names and chromosome names should not have any spaces.

Outputs

These output files are generated per chromosome.

  1. chromName.bed - chromosome, start, end positions for all the segmented regions
  2. MethylMatrix.chromName.levels.txt - weighted average methylation level matrix for all samples at DMRs
  3. chromName.HMM4.p_info - results table for the goodness of fit test over all possible regions
  4. chromName.HMM4.sites.txt - results table from 4 states HMM, one record per CpG position
  5. MethylSummary.chromName - results table for methylation variability computation, one record per CpG position (window for sliding window)
  6. chromName.matrix.gz - compressed file for the smoothed methylation level matrix

Requirements:

  1. This program uses Rscript from R.
  2. This program uses Perl.

Note on memory requirements and parallelization

Memory requirement varies with size of individual chromosomes to be processed and number of samples. For very large sample sizes, we suggest splitting large chromosomes into region chunks and providing multiple chunks. Use the -c option to indicate separate CpG position bed files should be used and then make sure that mf_list file have the same number of record per chromosomal segment.

For faster processing, different chromosomes can be processed in parallel via multiple different instances of cgDMR-miner with different input file lists. Large chromosomes can further be split into chunks for parallelization.

Example CpG bed list file for two segments of chromosome 1. The two files cpg.pos.chr1_part1.bed and cpg.pos.chr1_part2.bed were generated by splitting the CpG positions on chromosome 1 into two bed files while making sure that the split occurs at positions where there is a large gap between CpGs. Split these files into two input files one for part1 and another for part2, then calling two different instances of cgDMR-miner would allow the processing of chromosome 1 in parallel.

chr1_part1	cpg.pos.chr1_part1.bed
chr1_part2	cpg.pos.chr1_part2.bed

Example mf_list records for two samples at chromosome 1. Note that different chunks may point to the same methylation frequency file.

sample1	sample1.chr1.methylFreq	chr1_part1
sample1 sample1.chr1.methylFreq chr1_part2
sample2 sample2.chr1.methylFreq	chr1_part1
sample2 sample2.chr1.methylFreq	chr1_part2

Dependencies:

  1. R package 'bsseq' must be installed.
  2. R package 'pryr' must be installed.
  3. R package 'mhsmm' must be installed.
  4. Perl package Getopt::Std
  5. Perl package File::Path
  6. Perl package File::Basename;
  7. Perl package Cwd
  8. Perl package Statistics::LSNoHistory
  9. Perl package Statistics::Basic
  10. Perl package Math::Random

Description of input options:

Input Explanation of values
i A tab separated file with three columns and one row per sample per chromosome. The columns are: 1. sample id, 2. path to methylation frequency file, 3. chromosome id corresponding to the methylation frequency file. Required.
o Name for output directory. Required.
m Methylation segmentation mode; for chi-square use 'chsq' and for Jensen-Shannon divergence use 'jsd'. Default is 'jsd'. Note that the 'chsq' option may not work on very low methylation variability datasets.
d An integer that is the minimum total depth required in each sample for DMR summarization. Default is 10.
p A floating number that is the p-value cutoff for generating the weighted average methylation level matrix. Default is 0.01.
n Number of CpGs to include in sliding window. Default is 5. Note: sliding window don't always generate results.
s To perform smoothing with '''bsseq''' instead of pooling adjacent CpGs. Value 'yes' or 'no'. Default is 'yes'.
c A tab separated file with two columns, (1) chromosome name, (2) path to cpg positions bed file. For each chromosome, a bed file containing the CpG positions to be considered must be provided. If chromosome bed file is missing, then that chromosome will be ignored. Note that chromosome names should not contain any spaces. Required for smoothing.

Examples

Quantifying methylation variabilities

Methylation variabilities are quantified for artificial examples of ten samples (s1 to s10) across five CpG sites (#1-5). In (a), the acceptable metrics are bolded as they exhibit decreasing trend from #1 to #3. Jensen-Shannon distance metric is best able to distinguish 0.1 difference from 1.0 difference, since it has 25 folds difference between #1 and #3, the largest out of all the metrics. In (b), the acceptable metrics should give the same values for examples #4 and #5.

About

Generalized method for the identification of DMRs from low coverage data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages