Skip to content

Chromosome Overlap: code accompanying "Refining the genetic risk of breast cancer with rare haplotypes and pattern-mining" by Letsou et al. DOI: 10.26508/lsa.202302183

License

Notifications You must be signed in to change notification settings

wletsou/ChromosomeOverlap

Repository files navigation

Chromosome Overlap

This document outlines how to perform Chromosome Overlap in an HPC environment (LSF).  The goal is to obtain a list of closed haplotype patterns from (filtered meta-)chromosomes from a population consisting of cases and controls.

Initiation

Your input haplotype file haplotype_estimates.cases.${NAME}.txt should be of the following form:

sid rs0000001_A rs0000002_G rs0000003_C ...
0001 0 0 0 ...
0001 0 1 0 ...
0002 1 0 0 ...
0002 1 0 0 ...

Each row is chromosome from subject sid in the cases population; another file should exist for the controls.  Alternate alleles of SNP rsid_Alt are indicated with a 1 and reference alleles with a 0.  Such files are prepared from the scripts in scripts-haplotype_extraction.

Find all unique chromosomes in haplotype_estimates.cases.${NAME}.txt using

awk 'NR==1{char=length(NF-1)} NR>1{printf "1\t"; for (i=2;i<=NF;i++) {printf "%s%0"char"d_%s",(i>2?",":""),i-1,$i}; printf "\n"}' haplotype_estimates.cases.${NAME}.txt | awk 'BEGIN{OFS="\t"} {seen[$2]+=$1} END{for (i in seen) {print seen[i],i} }' > Pattern_combined.Iteration000.${NAME}_${PATTERN}.txt

where ${NAME} is a location label indicating the chromosome and position of the haplotypes, and ${PATTERN} is a suffix indicating the type of overlap (e.g., 2,j indicates pairwise overlap of and 2,3,j triple-wise).  These names will be used for combining the results of different jobs.  The output Pattern_combined.Iteration000.${NAME}_${PATTERN}.txt will be a two-column table of counts of meta-chromosomes:

2 01_1,02_0,04_0,...
1 01_1,02_1,03_0,...
... ...

Each meta-chromosome is a list of SNPs (indexed by their column numbers in your haplotype file) and their corresponding alleles (0 or 1).

Perform the initial overlap with at least 50 overlaps per job by running:

bsub -P ChromosomeOverlap -J ChromosomeOverlap_iteration_sub_parallel.v3.Iteration000 -oo ChromosomeOverlap_iteration_sub_parallel.v3.Iteration000.out -eo ChromosomeOverlap_iteration_sub_parallel.v3.Iteration000.err -R "rusage[mem=512]" "sh ChromosomeOverlap_iteration_sub_parallel.v3.sh \"${NAME}\" 2 \"${Pattern}\" 50 0"

The overlaps are partioned to nodes using index2combo2.R, which assigns a unique multiindex i_1,i_2,...,i_sigma corresponding to the Ith overlap of $SIGMA chromosomes in a σ-dimensional triangular array:

RScript index2combo2.R I=starting_index n=total_number_of_chromosomes sigma=$SIGMA

The value 50 indicates that the total number of overlaps per job should be 50, which value doubles until the total number of jobs required to complete all pairwise overlaps is below a threshold MAX_JOBS; 2 indicates that the overlaps should consist of two chromosomes each; and 0 indicates that no overlaps have been completed yet (corresponding to Iteration000).  The output of this step is a file Pattern_combined.Iteration001.${NAME}_${PATTERN}.txt, containing counts of meta-chromosomes and their corresponding SNP-alleles, and another file Closed_patterns_stats.${NAME}_${PATTERN}.txt, which keeps track of all the closed patterns found so far and the iteration at which they were discovered.  The program automatically halts after completing iteration 1 to allow the user to apply p-value-based filtering.

Filtering

Get the segregation of each pattern discovered at iteration 1 in Closed_patterns_stats.${NAME}_${PATTERN}.txt between cases and controls using:

bsub -P ChromosomeOverlap -J ChromosomeOverlap_haplotype_count_sub.v5.Iteration001 -oo ChromosomeOverlap_haplotype_count_sub.v5.Iteration001.out -eo ChromosomeOverlap_haplotype_count_sub.v5.Iteration001.err -R "rusage[mem=256]" "sh ChromosomeOverlap_haplotype_count_sub.v5.sh haplotype_estimates.cases.{NAME}.txt,haplotype_estimates.controls.${NAME}.txt Closed_patterns_stats.${NAME}_${PATTERN}.txt 50 \"1\" \"Iteration001.${NAME}\""

where again 50 is the initial number of overlaps assigned to each job and 1 instructs the program to only take patterns in Closed_patterns_stats.${NAME}_${PATTERN}.txt discovered at iteration 1.  The counts are then fed to ChromosomeOverlap_fisher_exact.R for computation of the p-values.  The result is a file fisher_exact.Iteration001.${NAME}.patterns_0000001-total_number_of_patterns.txt with OR in the fourth field and p-value in the fifth.  Filtering can then be applied to update the Pattern_combined file using:

awk 'NR==FNR{seen[$1]; next} ($2 in seen){print $0}' <(awk '(NR>1 && $5+0<1e-9 && $4+0>1){print $1}' fisher_exact.Iteration001.${NAME}.patterns_0000001-total_number_of_patterns.txt) Pattern_combined.Iteration001.chr11.69231642-69431642_2,j.txt > Pattern_combined.Iteration001.${NAME}_${PATTERN}.tmp
mv Pattern_combined.Iteration001.${NAME}_${PATTERN}.txt Pattern_combined_old.Iteration001.${NAME}_${PATTERN}.txt
mv Pattern_combined.Iteration001.${NAME}_${PATTERN}.tmp Pattern_combined.Iteration001.${NAME}_${PATTERN}.txt

which selects risk-increasing patterns with p-values less than 1e-9.  The unfiltered list is saved as Pattern_combined_old.Iteration001.${NAME}_${PATTERN}.txt.  To try a different threshold, first run:

mv Pattern_combined_old.${NAME}_${PATTERN}.txt Pattern_combined.${NAME}_${PATTERN}.txt:

Iteration

Once the filtered patterns are selected, run:

bsub -P ChromosomeOverlap -J ChromosomeOverlap_iteration_sub_parallel.v3.Iteration001.${NAME} -oo ChromosomeOverlap_iteration_sub_parallel.v3.${NAME}.out -eo ChromosomeOverlap_iteration_sub_parallel.v3.Iteration001.${NAME}.err -R "rusage[mem=512]" "sh ChromosomeOverlap_iteration_sub_parallel.v3.sh \"${NAME}\" 2 \"${PATTERN}\" 50 1"

to initiate the overlaps starting from iteration 1.  All new patterns discovered at each iteration will be added to Closed_patterns_stats.${NAME}_${PATTERN}.txt.  The process ceases when no new patterns are discovered, typically after five iterations have been completed.  The patterns can be assessed for association with disease using Fisher's exact test:

bsub -P ChromosomeOverlap -J ChromosomeOverlap_haplotype_count_sub.v5.Iteration001-005 -oo ChromosomeOverlap_haplotype_count_sub.v5.Iteration001-005.out -eo ChromosomeOverlap_haplotype_count_sub.v5.Iteration001=005.err -R "rusage[mem=256]" "sh ChromosomeOverlap_haplotype_count_sub.v5.sh haplotype_estimates.cases.{NAME}.txt,haplotype_estimates.controls.${NAME}.txt Closed_patterns_stats.${NAME}_${PATTERN}.txt 50 \"\" \"Iteration001-005.${NAME}\""

This code finds the counts of all patterns in Closed_patterns_stats.${NAME}_${PATTERN}.txt, starting from iteration 0.  But if only new patterns are desired, one can create a subset file

awk 'BEGIN{OFS="\t"} NR==FNR{seen[$2]; next} ($3 in seen || $1>1){print $0}' Pattern_combined.Iteration001.${NAME}_${PATTERN}.txt Closed_patterns_stats.${NAME}_${PATTERN}.txt > Closed_patterns_stats.Iteration001-005.${NAME}_${PATTERN}.txt

and run ChromosomeOverlap_haplotype_count_sub.v5.sh on it.

About

Chromosome Overlap: code accompanying "Refining the genetic risk of breast cancer with rare haplotypes and pattern-mining" by Letsou et al. DOI: 10.26508/lsa.202302183

Topics

Resources

License

Stars

Watchers

Forks

Packages