Skip to content

SLURM Cluster

francosimonetti edited this page May 13, 2021 · 9 revisions

The following instructions assume that you have a working version of Tejaas installed in your $HOME directory. See Installation instructions

TL;DR: Here are the two files you will need ( bash script file and template SLURM file ) to run tejaas on a SLURM cluster.

Run Tejaas on SLURM explained

We will create a small pipeline for running Tejaas genome-wide, parallelizing each chromosome in chunks to run on a SLURM array job. For that, we will need the following:

  1. Create a file named run_slurm_array.sh and define the path to your important files, i.e genotype, expression and gencode files.
#!/bin/bash

GENOFILE="/path/to/data/CHR1.vcf.gz"
EXPRFILE="/path/to/data/expression.txt"
GENEINFO=/path/to/data/gencode.v26.annotation.gtf.gz

# Some parameters for Tejaas
TEJAAS_BIN=/path/to/tejaas/bin/tejaas
TJMETHOD=jpa-rr
NULLMODL=perm
SNPTHRES=0.00000005
GENTHRES=0.01
SBETA=0.1

Keep in mind that Tejaas will not work on covariate corrected expression and/or PEER corrected expression. Expression data that has been corrected using linear model regression will make some singular values of the expression matrix equal to zero, breaking the model assumptions. Nonetheless, Tejaas incorporates KNN correction for finding trans-eQTLs. If you would like to use your linear covariate corrected expression data, Tejaas can use it on a separate step for finding the trans-eQTL target genes. You can include it as a separate file.

EXPR_CORR="/path/to/data/expression_linear_corrected.txt"
  1. Here we assume you have one vcf.gz file per chromosome. If you have one big VCF, omit this. Create a file with number of SNPs per chromosome like this. You only need to do this once.
for i in {1..22}; do
    echo $i `zcat /path/to/data/CHR${i}.vcf.gz |grep -v "#" |wc -l` >> ntot_per_chromosome.txt;
done

We will use it later to calculate the chunks on which we will run Tejaas.

  1. Define how many SNPs you want to run on each job. We will use 40000 since it takes about 15-20 minutes if we ran it on 8 cores for 450 samples. For each chromosome, calculate how many jobs you will need. We will use CHRM=1 for now, we will put this inside a for loop later.
CHRM=1
NMAX=40000
NTOT_CHRM=$( sed "${CHRM}q;d" ${CHRM_NTOT_FILE} | awk '{print $2}' )
NJOBS=$(echo $(( NTOT/NMAX )))
  1. Create a new separate template file for the slurm batch job and save it as tejaas_array.slurm. In this way, you will only need to replace a few strings with your data on it and run the batch job with sbatch. Here is a basic template:

Define parameters for the slurm queue

#!/bin/sh
#SBATCH -p medium  #<--- replace with your partition name
#SBATCH -t 0-6:00:00
#SBATCH -n 8
#SBATCH -N 1
#SBATCH -J _JOB_NAME
#SBATCH -o _JOB_NAME.out
#SBATCH -e _JOB_NAME.err

# Remember to load all the libraries and modules here such as intel compilers and intel MKL and MPI libraries for Tejaas to work

Calculate the SNP chunks to use

# Here we pass the information about how many SNPs per job and how many SNPs in total. This is how each job inside the job array will know which chunk of SNPs to use
NMAX=_NUM_MAX_
NTOT=_NUM_TOT_
startsnp=$(( NMAX * SLURM_ARRAY_TASK_ID + 1 ))
endsnp=$(( NMAX * (SLURM_ARRAY_TASK_ID + 1) ))
if [ $endsnp -gt $NTOT ]; then
    endsnp=${NTOT}
fi
INCSTRNG="${startsnp}:${endsnp}"

The variable SLURM_ARRAY_TASK_ID is an internal SLURM variable that counts the number of jobs in the job array. It will take integer values from 0 until NJOBS.

Input parameters for Tejaas and run it with mpi

# Other Tejaas parameters
RUN_PATH=_TJS_BINR
GENOFILE=_GT_FILE_
EXPRFILE=_EXPR_FL_
GENEINFO=_GEN_POSF
TJMETHOD=_TJ_METHD
NULLMODL=_NULL_MDL
OUTPRFIX=_OUT_PRFX
SNPTHRES=_SNP_CUT_
GENTHRES=_GEN_CUT_
SBETA=_SIG_BETA
CHROM=_CHRM_NUM
EXPRFILE_CORR=_EXPRCORR_

mpirun -n 8 ${RUN_PATH} --vcf          ${GENOFILE} \
                        --gx           ${EXPRFILE} \
                        --gtf          ${GENEINFO} \
                        --chrom        ${CHROM}    \
                        --method       ${TJMETHOD} \
                        --null         ${NULLMODL} \
                        --outprefix    ${OUTPRFIX} \
                        --psnpthres    ${SNPTHRES} \
                        --pgenethres   ${GENTHRES} \
                        --include-SNPs ${INCSTRNG} \
                        --prior-sigma  ${SBETA} \
                        --gxcorr       ${EXPRFILE_CORR} \
                        --knn 30 \
                        --cismask

  1. Now, let's go back to the file from 3. and complete our submission script. Define where the output should go and replace all the data into the template file
JOBPREFIX="tejaas_chr${CHRM}"
OUTDIR="/path/to/data/output
OUTPREFIX="${OUTDIR}/chunk"

# create the job submission file
sed "s|_JOB_NAME|${JOBPREFIX}|g;
     s|_TJS_BINR|${TEJAAS_BIN}|g;
     s|_GT_FILE_|${GENOFILE}|g;
     s|_EXPR_FL_|${EXPRFILE}|g;
     s|_GEN_POSF|${GENEINFO}|g;
     s|_TJ_METHD|${TJMETHOD}|g;
     s|_NULL_MDL|${NULLMODL}|g;
     s|_OUT_PRFX|${OUTPREFIX}|g;
     s|_NUM_TOT_|${NTOT}|g;
     s|_NUM_MAX_|${NMAX}|g;
     s|_SNP_CUT_|${SNPTHRES}|g;
     s|_GEN_CUT_|${GENTHRES}|g;
     s|_SIG_BETA|${SBETA}|g;
     s|_CHRM_NUM|${CHRM}|g;
     s|_EXPRCORR_|${EXPR_CORR}|g;
     " tejaas_array.slurm > ${jobname}.slurm

sbatch --array=0-${NJOBS}%20 ${jobname}.slurm

This last line will send your job array to the SLURM queue. Notice we put %20 to limit the number of simultaneous jobs to run to 20.

  1. Finally, let's give execution permissions to our script and run it.
chmod +x run_slurm_array.sh
./run_slurm_array.sh