Skip to content

Configuration

Sander W. van der Laan edited this page Apr 17, 2023 · 5 revisions

Below, you can find the full list of customizable parameters included in the configuration file (pgstoolkit.conf).

Note that before running the toolkit, you will also need to change the SLURM settings at the top of the pgstoolkit.sh file. Also, make sure to remove the '/' (forward-slash) at the end of any directory variable.

General settings

R = required O = optional

Parameter Description
PRSMETHOD Indicate what method to use [PLINK/RAPIDOPGS/PRSCS/PRSICE/NONE]. Pick NONE if you only whish to perform quality control.
PROJECTNAME Name of the project.
PROJECT_DIR Path to where the main analysis directory resides.
OUTPUT_DIRNAME Name of the output directory within the PROJECT_DIR directory.
SUBPROJECT_DIR_NAME Name of (sub)project -- this will be used to create subfolders within the OUTPUTDIR.
MAIN_WORKDIR_NAME Name of the working directory within the main analysis directory, used for temporary files.
LOG_DIRNAME Name of the subdirectory of the PROJECT_DIR directory used for storing log files.
QC Indicate whether quality control should be applied according to the MAF and INFO parameters. [YES/NO]
MAF Minimum minor allele frequency to keep variants, e.g. "0.005".
INFO Minimum imputation quality score to keep variants, e.g. "0.3".
KEEP_TEMP_FILES Keep the files temporarily generated by the toolkit at the end of the job. [TRUE/FALSE]
SAVE_CONFIG Save a copy of this configuration file along with the results. [TRUE/FALSE]

Input settings

Parameter Description RapidoPGS PRS-CS PRSice PLINK
BASEDATA Path to the file containing the base data. R R R R
BF_BUILD Build of the base file, e.g. "hg19" or "hg38". R
BF_ID_COL Name of the SNP ID column in the base file. R R R R
BF_CHR_COL Name of the chromosome column in the base file. R R
BF_POS_COL Name of the position column in the base file. R R
BF_EFFECT_COL Name of the effect allele column in the base file. R R R R
BF_NON_EFFECT_COL Name of the non-effect allele column in the base file. R R R
BF_STAT Type of measure in the BF_STAT_COL, either "beta" or "or". * R R
BF_STAT_COL Name of the beta/OR/effect size column in the base file. R R R R
BF_FRQ_COL Name of the effect allele frequency column in the base file. R/O**
BF_SE_COL Name of the column of the standard error of the beta/OR value. R
BF_PVALUE_COL Name of the column containing the P-values of the assocation test. R R R
BF_SBJ_COL Name of the column containing the sample size for each variant. R/O***
BF_SAMPLE_SIZE Sample size of the GWAS R/O*** R
BF_TARGET_TYPE "cc" for a case control trait, "quant" for a quantative trait R
LDDATA Path to the linkage disequilibrium reference data. PRS-CS and PRSice require a different format. R**** O*****
VALIDATIONDATA Path to the directory containing the validation data, e.g. /hpc/data/_ae_originals. R R R R
VALIDATIONPREFIX Prefix of the validation files in BGEN format v1.2, excluding the chr-number and extension, e.g. aegs_combo_1kGp3GoNL5_RAW_chr. R R R R
VAL_REF_POS Position of the reference allele in the BGEN files relative to the alternative allele, ref-first, ref-last or ref-unknown. R R R
SAMPLE_FILE Path to the sample file. A description of the sample file format can be found here. R R R R
PRSICE_PHENOTYPE Phenotype which will be used by PRSice to find the best fitted set of polygenic scores, this phenotype must be present in the sample file. R
PRSICE_PHENOTYPE_BINARY [TRUE/FALSE] indicating whether PRSICE_PHENOTYPE contains a binary phenotype. R
STATS_FILE Path to the stats file. O O O O
STATS_ID_COL Name of the stats file column containing the SNP IDs, these IDs must match the IDs that occur in the base file. O O O O
STATS_MAF_COL Name of the stats file column containing the minor allele frequency. O O O O
STATS_INFO_COL Name of the stats file column containing the imputation score. O O O O

Performance settings

Parameter Description RapidoPGS PRS-CS PRSice PLINK
RUNTIME_QC Maximal duration of the quality control sub-job. O O O O
RUNTIME_PLINKSCORE Maximal duration of the PLINK score sub-job. R R R
RUNTIME_PLINKSUM Maximal duration of the PLINK sum sub-job. R R R
RUNTIME_RAPIDO Maximal duration of the RapidoPGS sub-job. R
RUNTIME_PRSICE Maximal duration of the PRSice sub-job. R
RUNTIME_PRSCS Maximal duration of the PRS-CS sub-job. R
RUNTIME_PRSCS_format Maximal duration of the PRS-CS format sub-job. R
MEMORY_QC Maximal amount of RAM used for the quality control sub-job. O O O O
MEMORY_PLINKSCORE Maximal amount of RAM used for the PLINK score sub-job. R R R
MEMORY_PLINKSUM Maximal amount of RAM used for the PLINK sum sub-job. R R R
MEMORY_RAPIDO Maximal amount of RAM used for the RapidoPGS sub-job. R
MEMORY_PRSICE Maximal amount of RAM used for the PRSice sub-job. R
MEMORY_PRSCS Maximal amount of RAM used for the PRS-CS sub-job. R
MEMORY_PRSCS_format Maximal amount of RAM used for the PRS-CS format sub-job. R
PRSICE_CPUS Maximal amount of CPUs used for the PRSice sub-job. R
PRSCS_CPUS Maximal amount of CPUs used for the PRS-CS sub-job. R

PRSice settings

PRSice calculates the PRS for all individuals in the target population for a given phenotype. For more on PRSice parameters go here.

Parameter Description Required
PRSICE_EXTRACT File containing SNPs to be included in the analysis. PRSice will return an error if it runs into duplicate SNPs, in this case it will write the non-duplicate SNPs to a file in the working directory. Put the path to the generated file in this parameter to avoid this error. O
PRSICE_EXCLUDE File containing SNPs to be excluded from the analysis. O
PRSICE_CLUMP_KB Distance for clumping in kb, the default is "250". R
PRSICE_CLUMP_P P-value threshold used for clumping, default is "1". R
PRSICE_CLUMP_R2 r2 threshold for clumping, default is "0.1". R
PRSICE_PERM Number of permutations to perform, default is "10000". R
PRSICE_THREADS Number of threads to use, e.g. "20", if set to "max" the number of threads will be derived from the amount of dedicated CPUs (PRSICE_CPUS). R
PRSICE_SETTINGS Some (not all) additional settings for PRSice, e.g. PRSICE_SETTINGS="--no-clump --print-snp --extract PRSice.valid --score sum --missing center"
  • --score [option] - Method to calculate the polygenic score [avg/std/con-std/sum], e.g. "--score sum".
    • avg - Take the average effect size (default).
    • std - Standardize the effect size.
    • con-std - Standardize the effect size using mean and SD derived from control samples.
    • sum - Direct summation of the effect size.
  • --missing [option] - Way to handle missing genotypes.
    • mean_impute - Missing genotypes contribute an amount proportional to imputed frequency (default).
    • set_zero - Throw out missing observations.
    • center - Shift all scores to mean zero.
  • --no-clump - Don't use clump if you already filtered the data; in case of most GWAS results you do want to use clumping.
  • --print-snp - Print a list of SNPs used at the end in the modeling.
  • --seed [int] - Random number used for permutation, e.g. "91149214", usefull for when you would like to be able to generate identical results when providing the same input.
  • --no-default - Remove all default options, including the default behaviour of PRSice of searching for MAF and info columns in the base file and using those to filter SNPs.
  • --allow-inter - Allow the generation of a large intermediate file to speed up the calculation.
  • --fastscore - Only calculate the PRS for threshold(s) set within the bar level.
  • --all-score - Gives the PRS for each individual at all thresholds specified by --bar-levels, instead of only the PRS for each individual that explains most of the variation in the phenotype.
  • --bar-levels - Level(s) of barchart to be plotted, e.g. "0.05,0.1".
O

PLINK settings

Below the parameters for the PLINK allelic scoring function. This function is also used by RapidoPGS and PRS-CS as those are only able to compute effect sizes. Note that within this toolkit, PLINK is set to calculate the sum of the allele scores instead of the default average allele score. The reason behind this is that if we were to calculate the average for each chromosome, we would not be able to take the sum of all chromosomes. More on PLINK --score parameters here.

Parameter Description Required
PLINK_SETTINGS Optional settings of PLINK, e.g. "center no-mean-imputation se zs"
  • Dosage modifiers - Modify the default allelic dosages (you can only pick one).
    • center - Translates all dosages to mean zero.
    • variance-standardize - Transforms each variant's dosage vector to have mean zero, variance 1 (default, i.e. apply an additive model).
    • dominant - Causes dosages greater than 1 to be treated as 1.
    • recessive - Uses max(dosage-1,0) on diploid chromosomes.
  • no-mean-imputation - Throw out missing observations (default is to apply mean-imputation - similar to PRSice2 [https://choishingwan.github.io/PRSice/command_detail/]).
  • se - Causes the input coefficients to be treated as independent standard errors.
  • ignore-dup-ids - Don't throw an error when variant IDs occur multiple times in the input file (still prints warning).
  • list variants[-zs] - Write the variant IDs used for scoring to 'plink2.sscore.vars[.zst]'.
O

RapidoPGS settings

More on RapidoPGS parameters here.

Parameter Description Required
RP_filt_threshold Scalar indicating the ppi threshold (if filt_threshold < 1) or the number of top SNPs by absolute weights (if filt_threshold >= 1). R
RP_recalc Logical [TRUE/FALSE] indicating if weights should be recalculated after thresholding, only relevant if filt_threshold is defined. R
RP_ppi Scalar representing the prior probability, default is "1e-04". R
RP_prior The prior specifies that BETA at causal SNPs follows a centred normal distribution with standard deviation sd.prior, sensible and widely used DEFAULTs are 0.2 for case control traits, and 0.15 * var(trait) for quantitative selected if trait == "quant"). R
RP_REF Path to the reference file the SNPs should be filtered and aligned to, this file should have 5 columns (CHR, BP, SNPID, REF and ALT) and should be in the same build as the summary statistics. O

PRS-CS settings

Look here for a more detailed description of PRS-CS parameters.

Parameter Description Required
PRSCS_THREADS Maximum amount of threads, when empty, PRS-CS uses the maximum amount of threads of the CPUS dedicated to the job. O
BIM_FILE_AVAILABLE [YES/NO] indicating if a .bim file is already available, in this case BIM_FILE_PATH should also be specified, a .bim file can be retrieved from the tmp files of a previous run (https://www.cog-genomics.org/plink/2.0/formats#bim) and is specific to the validation dataset. R
BIM_FILE_PATH Path to the .bim file. O
PRSCS_SETTINGS Optional settings of PRS-CS (e.g. PRSCS_SETTINGS="--a 1 --b 0.5 --chrom 1,3,5").
  • --a [value] - Parameter a in the gamma-gamma prior, default is 1.
  • --b [value] - Parameter b in the gamma-gamma prior, default is 0.5.
  • --phi [value] - Global shrinkage parameter, e.g. "1e-6", if not specified phi will be learned from the data.
  • --n_iter [int] - Total number of MCMC iterations, default is 1000.
  • --n_burnin [int] - Number of burnin iterations, default is 500.
  • --thin [int] - Thinning factor of the Markov chain, default is 5.
  • --chrom [list] - The chromosome(s) on which the model is fitted, separated by comma, e.g. 1,3,5, default is iterating through chromosomes 1-22.
  • --beta_std [bool] - [True/False] whether to return standardized posterior SNP effect sizes, default is False.
  • --seed [int] - Non-negative integer which seeds the random number generator.
O