Skip to content

autosome-ru/ADASTRA-pipeline

Repository files navigation

ADASTra pipeline release Susan 12.04.2021

A pipeline for processing ChIP-seq read allignments in bam format to find allele-specific TF binding (ASB) events. It consists of 5 main parts:

A. SNP calling

This part uses GATK and PICARD tools for variant calling. The result is a vcf file with SNV calls in GATK vcf format.

B. Peak annotation and filtering

Homozygous SNVs, SNVs with less than 5 reads on each allele and not present in dbSNP common colection are filtered out from the vcf files obtained on the previous step. The resulting variants are annotated with ChIP-seq peaks from 4 different peak callers (if available in bed format).

C. BAD calling

Background Allelic Dosage (BAD) estimation and full-genome BAD maps construction. See BABACHI.

D. Negative Binomial Mixture fit

Fitting read count distributions separately for reference and alternative alleles and each BAD with Negative Binomial Mixtures.

E. Statistical evaluation of ASB

Performing one-tailed tests and aggregating the resulting P-values on TF and cell type level using Mudholkar-George method, FDR-correcting the resulting P-values. Evaluating ASB Effect Size.

Execution and installation

  1. Clone this repository to your machine or server
git clone https://github.com/autosome-ru/ADASTRA-pipeline/
  1. Fill the paths to the required files (listed below) in CONFIG.cfg file.
  2. Run python3 construct_parameters_python.py, then install adastra package with pip3 install ./ command
  3. Execute pipline_start.sh <n_tr> <stage>
    n_tr is max. number of jobs,
    stage is a flag, corresponding to a part of pipeline you wish to start with (listed in order):
  • --create-reference create normalized genome and index
  • --snp-call GATK snp calling
  • --peak-annotation peak annotation and filtering
  • --bad-call BAD estimation
  • --nb-fit fit negative binomial distributions
  • --pvalue-count evaluate statistical significance
  • --aggregate-pvalues perform cell-type and TF-level aggregation of p-values

Required software

General

  1. Java SE 8
  2. Python >= 3.6
  3. GATK >= 4.0.12.0
  4. PICARD
  5. GNU Parallel

Python packages

numpy>=1.19.0
pandas>=1.1.0
scipy>=1.5.1
statsmodels>=0.11.1

Required files

To run the pipeline successfully one must fill path for each file in the CONFIG.cfg file.

Directories

  • alignments_path = "/home/user/Alignments/" The directory with .bam files of experiment and control alignments. Should contains directories with experiment name with corresponding .bam files in them.
  • results_path = "/home/user/DATA/" A directory to save final ASB calls into.
  • intervals_path = "/home/user/interval/" A directory with peak calling data. Should contain a subdir for every caller (e.g. MACS), in each of which should be zipped bed-like files with peak calls (names are arbitrary, ending with .interval.zip). However, peaks from different callers, but for the same experiment must have the same name.

Files

  • master_list_path = "/home/user/PARAMETERS/Master-lines.tsv"
    A .tsv file with the following required columns(columns with other names are ignored), each row corresponding to a single experiment:
    '#EXP' - Unique experiment identifier. Must correspond to the folder in alignments_path with the bam file.
    TF_UNIPROT_ID - TF uniprot name, e.g. Q9GZV8 (or arbitrary TF identifier).
    CELLS - Name or identifier of cell type. Used in BADmaps groupping.
    READS - Used
    ALIGNS - name of corresponding .bam file without extention ('.bam').
    PEAKS - name of corresponding peak call files (without .interval.zip) or 'None'
    GEO - GSE of the study or 'None'
    ENCODE - encode id of the experiment or 'None'
    WG_ENCODE - wgEncode id of the experiment or 'None
    READS_ALIGNED - Number of the reads aligned (or '' if no info available)

  • genome_path = "/home/user/REFERENCE/genome.fasta" Path to the reference genome file.

  • dbsnp_vcf_path = "/home/user/REFERENCE/dbsnp_common.vcf.gz" Path to dbsnp common collection (gzipped)

  • repeats_path = "/home/user/repeats" Path to repeat annotation .bed file