eSCAN: Scan Regulatory Regions for Aggregate Association Testing using Whole Genome Sequencing Data
eSCAN (or “Scan the Enhancers”, with “enhancers” as a shorthand for any potential regulatory regions in the genome) is an R package for performing genome-wide assessment of rare variants residing in enhancer regions that are significantly associated with a phenotype, combining the advantages of dynamic window selection with the advantages of increasing genomic annotation information, including chromatin accessibility, histone markers, and 3D chromatin conformation. eSCAN defines test windows across the genome by genomic annotation either specified by users or provided by eSCAN package such that each window marks putative regulatory region(s). eSCAN then searches the defined windows using fastSKAT and detects significant regions by an empirical/analytic threshold which adjusts for multiple testing of all the searching windows across the genome, of different sizes, including some overlapping windows.
install.packages("devtools")
devtools::install_github("yingxi-kaylee/eSCAN")
Please see the eSCAN user manual for detailed usage of eSCAN package. See the following section for a working example.
In this section, we will use a toy example to show how to use eSCAN package. Genotype data in this example is from the SCANG package, simulated using COSI, leveraging its calibrated model to closely resemble the LD patterns in African Americans.
library(eSCAN)
data(geno) # genotype data frame
data(pheno) # phenotype and covariates data frame
data(enhancer) # enhancer data frame, optional
The pheno
data contains one column for phenotype and two for
covariates which were generated from the standard normal distribution
and Bernoulli distribution (p=0.5), respectively. The phenotype values
were then generated using the two covariates and error term, along with
causal variants randomly selected from two signal regions (the signal
regions were also randomly selected with length 3 kb). The enhancer data
was simulated as follows, generated by eGenerator()
function in eSCAN
package using parameter maxgap=300
.
head(enhancer)
#> Index Start_pos End_pos Start_index End_index Length
#> 1 1 54 2769 1 48 48
#> 3 2 4510 8272 63 128 66
#> 4 3 8655 13837 129 212 84
#> 5 4 14152 19509 213 302 90
#> 6 5 19943 22963 303 359 57
#> 7 6 23452 27073 360 425 66
The first column is index of the enhancers, the next two columns are start and end positions of the enhancers. The next two are start and end indices of the enhancers (indices in the genotype matrix sorted by genomic positions). The last column is the number of rare variants residing in the enhancers.
First, fit a generalized linear model under the null hypothesis. The
parameter x
is a data frame containing phenotype and covariates, each
row indicates an individual and each column indicates a
phenotype/covariate; the parameter outcome
is used to specify the name
of phenotype column in the data frame x
, and the parameter covars
is
to specify the names of covariates. Then the parameter fam
is either
“gaussian” for a continuous phenotype or “binomial” for a dichotomous
phenotype.
nullmod <- fitNull(x = pheno, outcome = "phenotype", covars=c("X1", "X2"), fam="gaussian")
Then use the null model to run the main function eSCAN()
. The
parameter times
indicates the number of Monte Carlo simulations
(default=1000
). For this toy example we set 10 so that you should be
able to get results within one minute.
res_eSCAN <- eSCAN(genotype = geno, nullmod = nullmod, new_enloc = enhancer, times=10,
alpha=0.05, analy=FALSE)
The function returns a list containing 3 elements: res
, res0
and
thres
. res
is a matrix of results for significant regions detected
by eSCAN. The first column is the p-value of the detected region(s), and
the next two columns are start and end position of the detected
region(s).
res_eSCAN$res
#> pvalue Start_pos End_pos
#> 25 0.0012677567 229802 234983
#> 26 0.0000276419 235306 258955
#> 28 0.0028600775 264096 274969
#> 30 0.0007865199 281338 284343
res0
is a matrix to summarize all the regions included in the
analysis.
res_eSCAN$res0[1:10,]
#> pvalue Start_pos End_pos
#> 1 0.09543621 54 2769
#> 3 0.10793791 4510 8272
#> 4 0.35983971 8655 13837
#> 5 0.85590415 14152 19509
#> 6 0.73614514 19943 22963
#> 7 0.68473027 23452 27073
#> 8 0.23106533 27487 40347
#> 9 0.09198459 40675 67183
#> 11 0.76369065 69689 76915
#> 12 0.21920216 77228 92139
thres
is the threshold of eSCAN to control the family-wise/genome-wide
error rate at alpha
level.
res_eSCAN$thres
#> 5%
#> 0.004935235
The current version is 0.1.1 (June 21, 2020).
This software is licensed under GPLv3.
Yingxi Yang, Yuchen Yang, Le Huang, Jai G. Broome, Adolfo Correa, Alexander Reiner, Laura M. Raffield, and Yun Li (2021). “eSCAN: Scan Regulatory Regions for Aggregate Association Testing Using Whole Genome Sequencing Data.” bioRxiv, 2020.11.30.405266.
Zilin Li, Xihao Li, Yaowu Liu, Jincheng Shen, Han Chen, Hufeng Zhou, Alanna C. Morrison, Eric Boerwinkle, and Xihong Lin (2019) “Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole-Genome Sequencing Studies”. The American Journal of Human Genetics, 104(5), 802-814.
Gogarten SM, Sofer T, Chen H, Yu C, Brody JA, Thornton TA, Rice KM, Conomos MP (2019). “Genetic association testing using the GENESIS R/Bioconductor package.” Bioinformatics. doi:10.1093/bioinformatics/btz567.