Skip to content

Input format

Sander W. van der Laan edited this page Apr 17, 2023 · 5 revisions

Base file

For RapidoPGS, PRS-CS and PRSice2, provide the GWAS summary statistics data containing the genotype-phenotype associations as base file. For PLINK2, provide the file containing the posterior variant effect sizes (or weights) as base file. Preferably, the base file should be .gz compressed, but plain text should also work. The file should be tab-delimited and the header should consist of a single line. The table below describes the configuration file parameters required for each method. R and O indicate that a parameter is 'required or 'optional', respectively.

Parameter Description RapidoPGS PRS-CS PRSice PLINK
BASEDATA Path to the file containing the base data R R R R
BF_BUILD Build of the base file, e.g. "hg19" or "hg38" R
BF_ID_COL Name of the SNP ID column in the base file R R R R
BF_CHR_COL Name of the chromosome column in the base file R R
BF_POS_COL Name of the position column in the base file R R
BF_EFFECT_COL Name of the effect allele column in the base file R R R R
BF_NON_EFFECT_COL Name of the non-effect allele column in the base file R R R
BF_STAT Type of measure in the BF_STAT_COL, either "beta" or "or" * R R
BF_STAT_COL Name of the beta/OR/effect size column in the base file R R R R
BF_FRQ_COL Name of the effect allele frequency column in the base file R/O**
BF_SE_COL Name of the column of the standard error of the beta/OR value R
BF_PVALUE_COL Name of the column containing the P-values of the assocation test R R R
BF_SBJ_COL Name of the column containing the sample size for each variant R/O***
BF_SAMPLE_SIZE Sample size of the GWAS R/O*** R
BF_TARGET_TYPE "cc" for a case control trait, "quant" for a quantative trait R

* RapidoPGS might or might not support odds ratios.
** Required for Rapido for quantative traits only.
*** For quantative traits using Rapido, provide either BF_SBJ_COL or BF_SAMPLE_SIZE.

Target data

The target data contains the genotypes of individuals within a population. We currently only support target data in BGEN format v1.2. BGEN v1.1 and v1.3+ have not been tested and might work. The following configuration file parameters related to the target data are required for all methods:

  • VALIDATIONDATA : path to the directory containing the validation data, e.g. /hpc/data/_ae_originals.
  • VALIDATIONPREFIX : prefix of the validation data excluding the chr-number and extension, e.g. aegs_combo_1kGp3GoNL5_RAW_chr.
  • VAL_REF_POS : position of the reference allele in the .BGEN files relative to the alternative allele: ref-first, ref-last or ref-unknown.

Note: It is perhaps superfluous to note that the format of BF_ID_COL should follow that of the SNP ID noted in the target data (a.k.a. 'validation'-data).

Sample file

Please provide a sample file in the SNPTEST sample file format. The identifiers in the ID column must of course match the identifiers of the target population. Phenotypes and covariates are not used in the polygenic score computations by PRS-CS, RapidoPGS and PLINK2. PRSice2 does however require a single phenotype which it uses to find the best fitted set of polygenic scores across multiple P-value thresholds. The phenotype used by PRSice2 is supplied using the PRSICE_PHENOTYPE and PRSICE_PHENOTYPE_BINARY parameters. Also note that for PRSice2, the samples should occur in the same order as they occur in the BGEN files, otherwise PRSice will return an error.

  • SAMPLE_FILE : path to the sample file.
  • PRSICE_PHENOTYPE : phenotype which will be used by PRSice2 to find the best fitted set of polygenic scores, this phenotype must be present in the sample file.
  • PRSICE_PHENOTYPE_BINARY : [TRUE/FALSE] indicating whether PRSICE_PHENOTYPE contains a binary phenotype.

LD reference

Several methods are able to use an external linkage disequilibrium reference panel. Such a panel is used to improve the LD estimation. These methods are PRS-CS and PRSice2.

  • LDDATA : path to the LD reference panel. Note that PRS-CS and PRSice2 expect a different format.

PRS-CS

PRS-CS requires an external LD panel. The developer recommends to use one of the panels supplied on the PRS-CS GitHub page. The panels were constructed using either 1000 Genomes Project phase 3 samples or UK Biobank data. For PRS-CS, please supply the path to the folder containing the extracted map and .hdf5 files, e.g. /data/ldblk_1kg_eur.

PRSice2

Reference data is optional and must be in .bed format or in BGEN format. If no reference data is provided, PRSice2 will use the target genotype for LD estimation. For PRSice2, please supply the path and prefix of the reference files, e.g. /data/ld_ref/1000Gp3v5.20130502.EUR.chr.

Stats file

This file is optional and can be used to perform quality control. If the QC parameter is active, the base file variants not meeting the imputation score and minor allele frequency thresholds will be removed. Such a file can for example be generated using SNPTEST. This file must be whitespace delimited, .gz compressed and must have a single header line.

  • STATS_FILE : path to the stats file
  • STATS_ID_COL : name of the stats file column containing the SNP IDs, these IDs must match the IDs that occur in the base file.
  • STATS_MAF_COL : name of the stats file column containing the minor allele frequency.
  • STATS_INFO_COL : name of the stats file column containing the imputation score.
Clone this wiki locally