Kevin Ryan 2022-09-07 17:48:16
- Introduction
- Preparation
- Read in data
- Exploratory data analysis
- PCA + clinical correlations all studies
- PCA + clinical correlations without In-House data
- PCA
- clinical correlations without In-House data or study EGAD00001006144
- Batch correction with Combat-Seq and limma
- Surrogate variable analysis
- Differential expression analysis
- Chemoresistance gene signature
- Deconvolution using CIBERSORTx
- References
Cancer-associated fibroblasts (CAFs) are a heterogeneous cell type found in the tumour microenvironment. They have a wide array of functions, and tend to be immunosuppressive and cancer-promoting. There have been many attempts to characterise subpopulations of CAFs, with much transcriptomic analysis being carried out in the Mechta-Grigoriou lab in Institut Curie. They have identified 4 ‘subpopulations’ which can be separated based on the expression of different markers:
- S1: FAPHigh, CD29Med-High, αSMAHigh, PDPNHigh, PDGFRβHigh
- S2: FAPNeg, CD29Low, αSMANeg-Low, PDPNLow, PDGFRβLow
- S3: FAPNeg-Low, CD29Med, αSMANeg-Low, PDPNLow, PDGFRβLow-Med
- S4: FAPLow-Med, CD29High, αSMAHigh, PDPNLow, PDGFRβMed
(Pelon et al. 2020)
FACS gating strategies can be used to isolate these subpopulations. The Mechta-Grigoriou group have done this and have generated bulk RNA-sequencing data for the S1, S3 and S4 subpopulations. They generated scRNA-sequencing data for the S1 subpopulation. This data was deposited on the European Genome Phenome Archive, and was accessed via a Data Transfer Agreement.
The following summarises the data obtained:
Subpopulation | Total samples | Studies (Samples) | Notes |
---|---|---|---|
S1 | 28 |
|
|
S2 | 0 | N/A | N/A |
S3 | 14 |
|
|
S4 | 15 |
|
|
With the juxta-tumour data, tumour and juxta-tumour samples came from the same patient. However, the metadata gives no indication of these pairings. We could possibly use Optitype (Szolek et al. 2014) to determine HLA allele and match the tumour and juxta-tumour samples.
We also have scRNA-seq data for S1, labelled with 8 subpopulations of S1 CAFs. It may be possible to use CIBERSORT (Newman et al. 2015) and BayesPrism (Chu et al. 2022) to deconvolve the bulk S1 RNA-sequencing data to further confirm the presence of these subpopulations.
It is likely that sorting the cells using FACS alters the
transcriptional properties of the cells compared to if they are
separated using spreading approaches, as is seen in study
EGAD00001006144
and described in (Kieffer et al. 2020). This is
something that we will have to keep in mind.
The data was processed using nf-core/rnaseq version 3.8.1
using the
default parameters. STAR/Salmon were used for alignment/quantification.
We would expect our tumour-associated normal to be most like the S3 subpopulation (usually accumulate in juxta-tumours). The S2 subpopulation has been found to accumulate more in luminal A breast cancer, whereas the S4 subpopulation tends to be present in Her2+ breast cancers. Unfortunately, data is not available for the S2 subpopulation and 11 of the 12 cancers encountered in our samples are Luminal A.
Combining RNA-sequencing datasets from different studies can be very
challenging. We can expect batch effects to be present, so it might not
be possible to determine whether differences we observe are due to
actual biological effects or technical artifacts. In addition, a recent
study suggests that DESeq2 and edgeR (the most popular differential
expression tools) experience large rates of false positives when used
with large sample sizes (Li et al. 2022). However, this assertion has
been refuted, and it has been suggested that the Li 2022 study did not
apply appropriate batch correction and quality control (Twitter
thread
from Mike Love and associated code on
GitHub).
One of the datasets (EGAD00001006144
) was produced using stranded
RNA-seq, whereas the other datasets were unstranded. This can lead to a
lack of comparability of the datasets (Zhao, Ye, and Stanton 2020). It
may be necessary to drop this dataset from the analysis. All samples
were prepared by poly(A) selection (use of oligo-dT).
Columns will be: Sample, Study, Subpopulation, Tumor_Juxtatumor
Here we will be combining data from 5 studies. To begin with, we will only include the metadata available for all studies (except for our unknown CAF Subpopulation label). Breast cancer subtype is only available for certain studies and so is not included at this stage.
There are also: ovarian cancer samples, EPCAM+ cells (an epithelial marker) and samples from lymph nodes. For the time being, I will not consider them.
Samples were processed with nf-core/rnaseq version 3.8.1
.
Salmon was used in alignment mode so there is no salmon index and no
checksum to import the metadata. Therefore, the parameters recommended
in the tximeta
vignette
were used to summarise transcript counts to the gene level, using a
tx2gene file constructed using generate_tx2gene_table.R
.
# salmon was used in alignment mode so there is no salmon index, therefore there is no checksum to import the metadata
# txOut = FALSE means to summarise to gene level (i.e. don't give out transcripts, give out gene level)
se <- tximeta(coldata, skipMeta=TRUE, txOut=FALSE, tx2gene=tx2gene)
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
## removing duplicated transcript rows from tx2gene
## transcripts missing from tx2gene: 18466
## summarizing abundance
## summarizing counts
## summarizing length
A DESeqDataSet was created, and only genes with a count of 10 or more in 7 samples were kept for further analysis. This was done for the combined dataset both with and without the in-house data. A variance stabilising transformation (vst) was also carried out to allow the data to be used in downstream processes requiring homoskedastic data, e.g. PCA. vst was carried out instead of regularised logarithm (rlog) as rlog takes a long time to run when there is a large sample size.
dds <- DESeqDataSet(se, design = ~1)
## using counts and average transcript lengths from tximeta
# returns a vector of whether the total count of each gene is >= 10 (True or false)
keep <- rowSums(counts(dds)) >= 10
# only keep rows (genes) for which keep is TRUE
dds <- dds[keep,]
# at least X samples with a count of 10 or more, where X can be chosen as the sample size of the smallest group of samples
X <- 7
keep <- rowSums(counts(dds) >= 10) >= X
dds <- dds[keep,]
vsd <- vst(dds, blind = TRUE)
## using 'avgTxLength' from assays(dds), correcting for library size
# do no design for the time being, Subpopulation + Batch gives error - model matrix not full rank
# this function stores input values, intermediate calculations and results of DE analysis - makes counts non-negative integers
dds_no_inhouse <- DESeqDataSet(se_no_inhouse, design = ~1)
## using counts and average transcript lengths from tximeta
# returns a vector of whether the total count of each gene is >= 10 (True or false)
keep <- rowSums(counts(dds_no_inhouse)) >= 10
# only keep rows (genes) for which keep is TRUE
dds_no_inhouse <- dds_no_inhouse[keep,]
# at least X samples with a count of 10 or more, where X can be chosen as the sample size of the smallest group of samples
X <- 7
keep <- rowSums(counts(dds_no_inhouse) >= 10) >= X
dds_no_inhouse <- dds_no_inhouse[keep,]
For consistency between the DESeq2 plotPCA
function (which by default
takes the 500 most variable genes) and the PCATools pca
function, all
genes were used when carrying out PCA.
plotPCA(vsd, intgroup = c("Subpopulation"), ntop = nrow(vsd))
plotPCA(vsd, intgroup = c("Study"), ntop = nrow(vsd))
peigencor
ppairs
plotPCA(vsd_no_inhouse, intgroup = c("Subpopulation"), ntop = nrow(vsd_no_inhouse))
plotPCA(vsd_no_inhouse, intgroup = c("Study"), ntop = nrow(vsd_no_inhouse))
metadata_pca_no_inhouse <- colData(vsd_no_inhouse)[,1:5]
metadata_pca_no_inhouse <- as.data.frame(metadata_pca_no_inhouse[!names(metadata_pca_no_inhouse) %in% c("names")])
p_no_inhouse <- pca(vsd_mat_no_inhouse, metadata = metadata_pca_no_inhouse)
peigencor_no_inhouse <- eigencorplot(p_no_inhouse,
components = getComponents(p_no_inhouse, 1:10),
metavars = colnames(metadata_pca_no_inhouse),
col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
cexCorval = 0.7,
colCorval = 'black',
fontCorval = 2,
posLab = 'bottomleft',
rotLabX = 45,
posColKey = 'top',
cexLabColKey = 1.5,
scale = TRUE,
corFUN = 'pearson',
#corUSE = 'pairwise.complete.obs',
corUSE = 'pairwise.complete.obs',
corMultipleTestCorrection = 'none',
main = 'PCs clinical correlations without In-House',
colFrame = 'white',
plotRsquared = TRUE)
## Warning in eigencorplot(p_no_inhouse, components = getComponents(p_no_inhouse, :
## Study is not numeric - please check the source data as non-numeric variables
## will be coerced to numeric
## Warning in eigencorplot(p_no_inhouse, components = getComponents(p_no_inhouse, :
## Subpopulation is not numeric - please check the source data as non-numeric
## variables will be coerced to numeric
## Warning in eigencorplot(p_no_inhouse, components = getComponents(p_no_inhouse, :
## Tumor_JuxtaTumor is not numeric - please check the source data as non-numeric
## variables will be coerced to numeric
## Warning in eigencorplot(p_no_inhouse, components = getComponents(p_no_inhouse, :
## Strandedness is not numeric - please check the source data as non-numeric
## variables will be coerced to numeric
peigencor_no_inhouse
Here we can see strandedness (i.e. the study EGAD00001006144
) is
significantly correlated with PCs 1-3. These PCs explain 28%, 8% and 4%
of the variance of the entire dataset respectively).
ppairs
Given the plot above, we can see that most of the variability of PC1 can
be attributed to Study, indicating batch effects. Methods exist to
“remove” batch effects, but they can not be run on our data. All of our
S3 samples come from study EGAD00001004810
and so when one tries to
make a model matrix including both Study and Subpopulation, the model
matrix is of less than full rank and there is an error. Let’s look at
the clinical correlations plot if we remove our outlier study
EGAD00001006144
as well as our in-house data.
plotPCA(vsd_no_inhouse_no_6144, intgroup = c("Subpopulation"), ntop = nrow(vsd_no_inhouse_no_6144))
plotPCA(vsd_no_inhouse_no_6144, intgroup = c("Study"), ntop = nrow(vsd_no_inhouse_no_6144))
peigencor_reduced
ppairs_no_6144
ppairs_no_6144_colour_subpop
Combat-Seq models the distribution for each gene using a negative binomial regression model. This model contains mean and dispersion parameter. The model allows us to preserve changes in counts due to biological condition after adjustment. However, in our case, it is not possible to guarantee the preservation of changes due to biological condition (CAF subpopulation) while removing batch effects. Here we carry out batch correction using Combat-Seq, and will look at the data after adjustment to see if biological condition is preserved after adjustment. We should note here that the S3 subpopulation does not separate well from the other subpopulations before batch correction, even though the S3 data comes from a different “study”.
counts_matrix <- assay(dds)
batch_all_samples <- metadata$Study
adjusted_all_samples <- ComBat_seq(counts = counts_matrix, batch = batch_all_samples)
## Found 5 batches
## Using null model in ComBat-seq.
## Adjusting for 0 covariate(s) or covariate level(s)
## Estimating dispersions
## Fitting the GLM model
## Shrinkage off - using GLM estimates for parameters
## Adjusting the data
dds_batch_corrected <- DESeqDataSetFromMatrix(adjusted_all_samples, colData = metadata, design = ~1)
## converting counts to integer mode
# not sure about the blind = FALSE here
vsd_batch_corrected <- vst(dds_batch_corrected, blind = FALSE)
plotPCA(vsd_batch_corrected, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Study"), ntop = nrow(vsd_batch_corrected))
peigencor_batch_corrected
###
Combat-Seq with Tumour_Juxtatumor as covariate
adjusted_all_samples <- ComBat_seq(counts = counts_matrix, batch = batch_all_samples, group = metadata$Tumor_JuxtaTumor)
## Found 5 batches
## Using full model in ComBat-seq.
## Adjusting for 1 covariate(s) or covariate level(s)
## Estimating dispersions
## Fitting the GLM model
## Shrinkage off - using GLM estimates for parameters
## Adjusting the data
dds_batch_corrected <- DESeqDataSetFromMatrix(adjusted_all_samples, colData = metadata, design = ~1)
## converting counts to integer mode
# not sure about the blind = FALSE here
vsd_batch_corrected <- vst(dds_batch_corrected, blind = FALSE)
plotPCA(vsd_batch_corrected, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Study"), ntop = nrow(vsd_batch_corrected))
plotPCA(vsd_batch_corrected, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_batch_corrected))
peigencor_batch_corrected
Limma’s removeBatchEffect function requires a design matrix as input,
this is the “treatment condition” we wish to preserve. It is usually the
design matrix with all experimental factors other than batch effects.
The ideal scenario would be to include Subpopulation in this matrix.
However, this treats Unknown
as its own Subpopulation, and so will
preserve differences between the InHouse samples and the other samples.
This is contrary to what we want when assigning our samples to a
cluster.
vsd_not_blind <- vst(dds, blind = FALSE)
## using 'avgTxLength' from assays(dds), correcting for library size
mat <- assay(vsd_not_blind)
# create model matrix, full model matrix with Tumor_JuxtaTumor and Subpopulation
mm <- model.matrix(~Tumor_JuxtaTumor+Subpopulation, colData(vsd_not_blind))
mat <- limma::removeBatchEffect(mat, batch = vsd_not_blind$Study, design = mm)
## Coefficients not estimable: batch2 batch4
## Warning: Partial NA coefficients for 17763 probe(s)
vsd_batch_corrected_limma_all_covariates <- vsd_not_blind
assay(vsd_batch_corrected_limma_all_covariates) <- mat
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected_limma_all_covariates))
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Study"),ntop = nrow(vsd_batch_corrected_limma_all_covariates) )
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_batch_corrected_limma_all_covariates))
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Subpopulation"), ntop = 500)
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Study"),ntop = 500 )
plotPCA(vsd_batch_corrected_limma_all_covariates, intgroup = c("Tumor_JuxtaTumor"), ntop =500)
We can see that this removes batch effects and preserves differences between the subpopulations. However, it treats our Unknown samples as their own subpopulation.This is a case where the number of highest variable genes chosen makes a big difference.
peigencor_batch_corrected
# create model matrix, full model matrix with Tumor_JuxtaTumor and Subpopulation
mm <- model.matrix(~1, colData(vsd_not_blind))
mat <- limma::removeBatchEffect(mat, batch = vsd_not_blind$Study, design = mm)
vsd_batch_corrected_limma_no_covariates <- vsd_not_blind
assay(vsd_batch_corrected_limma_no_covariates) <- mat
plotPCA(vsd_batch_corrected_limma_no_covariates, intgroup = c("Subpopulation"), ntop = nrow(vsd_batch_corrected_limma_no_covariates))
plotPCA(vsd_batch_corrected_limma_no_covariates, intgroup = c("Study"),ntop = nrow(vsd_batch_corrected_limma_no_covariates) )
plotPCA(vsd_batch_corrected_limma_no_covariates, intgroup = c("Tumor_JuxtaTumor"), ntop = nrow(vsd_batch_corrected_limma_no_covariates))
peigencor_batch_corrected
dds_no_inhouse_no_6144 <- DESeq(dds_no_inhouse_no_6144)
## Warning in DESeq(dds_no_inhouse_no_6144): the design is ~ 1 (just an intercept).
## is this intended?
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 4760 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
dat <- counts(dds_no_inhouse_no_6144, normalized = TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx, ]
# we have Subpopulation and tumour-juxtatomour in the model matrix
mod <- model.matrix(~ Subpopulation + Tumor_JuxtaTumor, colData(dds_no_inhouse_no_6144))
mod0 <- model.matrix(~ 1, colData(dds_no_inhouse_no_6144))
svseq_no6144 <- svaseq(dat, mod, mod0)
## Number of significant surrogate variables is: 18
## Iteration (out of 5 ):1 2 3 4 5
sv_names <- paste("SV", seq(1,svseq_no6144$n.sv), sep = "")
for (i in 1:length(sv_names)){
colData(dds_no_inhouse_no_6144)[,sv_names[i]] <- svseq_no6144$sv[,i]
}
colData(dds_no_inhouse_no_6144)[,"Subpopulation"] <- as.factor(colData(dds_no_inhouse_no_6144)$Subpopulation)
colData(dds_no_inhouse_no_6144)[,"Tumor_JuxtaTumor"] <- as.factor(colData(dds_no_inhouse_no_6144)$Tumor_JuxtaTumor)
# add surrogate variables to design DDS object
design(dds_no_inhouse_no_6144) <- ~ Tumor_JuxtaTumor + SV1 + SV2 + SV3 + SV4 + SV5 + SV6 + SV7 + SV8 + SV9 + SV10 + SV11 + SV12 + SV13 + SV14 + SV15 + SV16 + SV17 + SV18 + Subpopulation
#dds_no_inhouse_no_6144_deseq <- DESeq(ddssva_copy)
#write_rds(dds_no_inhouse_no_6144_deseq, file = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/dds_noinhouse_no_6144_deseq_hgnc_hasntbeenrun.Rds")
dds_no_inhouse_no_6144_deseq <- readRDS("dds_noinhouse_no6144_hgnc_25082022.Rds")
The aim here is to extract lists of genes that are upregulated in each subpopulation compared to the other 2 subpopulations. To do this, steps from the following tutorial were followed.
# define model matrix
mod_mat <- model.matrix(design(dds_no_inhouse_no_6144_deseq), colData(dds_no_inhouse_no_6144))
# e.g. for each sample that is S1, get the mean of the coefficients all the components of the formula
S1 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation == "S1",])
S3 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation == "S3",])
S4 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation == "S4",])
not_S1 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation %in% c("S3", "S4"),])
not_S3 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation %in% c("S1", "S4"),])
not_S4 <- colMeans(mod_mat[dds_no_inhouse_no_6144_deseq$Subpopulation %in% c("S1", "S3"),])
res_not_S1 <- results(dds_no_inhouse_no_6144_deseq, contrast = S1 - not_S1, filterFun = ihw, alpha = 0.05, lfcThreshold = 2, altHypothesis = "greater")
res_not_S3 <- results(dds_no_inhouse_no_6144_deseq, contrast = S3 - not_S3, filterFun = ihw, alpha = 0.05, lfcThreshold = 2, altHypothesis = "greater")
res_not_S4 <- results(dds_no_inhouse_no_6144_deseq, contrast = S4 - not_S4, filterFun = ihw, alpha = 0.05, lfcThreshold = 2, altHypothesis = "greater")
res_not_S1_shrink <- lfcShrink(dds_no_inhouse_no_6144_deseq, contrast = S1 - not_S1, res = res_not_S1, type = "ashr")
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
res_not_S3_shrink <- lfcShrink(dds_no_inhouse_no_6144_deseq, contrast = S3 - not_S3, res = res_not_S3, type = "ashr")
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
res_not_S4_shrink <- lfcShrink(dds_no_inhouse_no_6144_deseq, contrast = S4 - not_S4, res = res_not_S4, type = "ashr")
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
res_not_S1_shrink_extract <- get_upregulated(res_not_S1_shrink)
res_not_S3_shrink_extract <- get_upregulated(res_not_S3_shrink)
res_not_S4_shrink_extract <- get_upregulated(res_not_S4_shrink)
dfs_to_filter <- list(res_not_S1_shrink_extract, res_not_S3_shrink_extract, res_not_S4_shrink_extract)
dfs_filtered <- filter_dfs_antijoin_rownames(dfs_to_filter)
## NULL
## NULL
## NULL
names(dfs_filtered) <- c("S1", "S3", "S4")
s1_filtered <- dfs_filtered[[1]]
s1_filtered_annotations <- annotate_de_genes(s1_filtered, filter_by = "hgnc_symbol")
## Warning: Ensembl will soon enforce the use of https.
## Ensure the 'host' argument includes "https://"
s3_filtered <- dfs_filtered[[2]]
s3_filtered_annotations <- annotate_de_genes(s3_filtered, filter_by = "hgnc_symbol")
## Warning: Ensembl will soon enforce the use of https.
## Ensure the 'host' argument includes "https://"
s4_filtered <- dfs_filtered[[3]]
s4_filtered_annotations <- annotate_de_genes(s4_filtered, filter_by = "hgnc_symbol")
## Warning: Ensembl will soon enforce the use of https.
## Ensure the 'host' argument includes "https://"
s1_gs <- s1_filtered_annotations$Gene
s3_gs <- s3_filtered_annotations$Gene
s4_gs <- s4_filtered_annotations$Gene
genesets <- list(s1_gs, s3_gs, s4_gs)
names(genesets) <- c("S1", "S3", "S4")
idx <- which(dds$Study == "InHouse")
dds_inhouse <- dds[,idx]
inhouse_metadata <- read.csv("/home/kevin/Documents/PhD/rna_seq_bc/metadata/reformat_samples_extra_info.csv")
caf_es <- gsva(expr = dds_inhouse,
gset.idx.list = genesets,
method = "gsva",
kcdf = "Poisson",
mx.diff=FALSE)
## Warning in .filterFeatures(expr, method): 247 genes with constant expression
## values throuhgout the samples.
## Warning in .filterFeatures(expr, method): Since argument method!="ssgsea", genes
## with constant expression values are discarded.
## Estimating GSVA scores for 3 gene sets.
## Estimating ECDFs with Poisson kernels
## | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
caf_es$Patient <- inhouse_metadata$Patient
caf_es$Subtype <- inhouse_metadata$Subtype
caf_es$Grade <- inhouse_metadata$Grade
caf_es$Histology <- inhouse_metadata$Histology
subpopulationOrder <- c("tumor", "juxtatumor")
sampleOrderBySubpopulation <- sort(match(caf_es$Tumor_JuxtaTumor, subpopulationOrder),
index.return=TRUE)$ix
subpopulationXtable <- table(caf_es$Tumor_JuxtaTumor)
subpopulationColorLegend <- c(tumor="red", juxtatumor="green")
geneSetOrder <- c("S4", "S3", "S1")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]
#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_subpopulation_caf_tan.png")
heatmap(assay(caf_es)[geneSetOrder, sampleOrderBySubpopulation], Rowv=NA,
Colv=NA, scale="row", margins=c(3,5), col=hmcol,
ColSideColors=rep(subpopulationColorLegend[subpopulationOrder],
times=subpopulationXtable[subpopulationOrder]),
labCol="",
caf_es$Tumor_JuxtaTumor[sampleOrderBySubpopulation],
labRow=paste(toupper(substring(geneSetLabels, 1,1)),
substring(geneSetLabels, 2), sep=""),
cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.285,1.1, "CAF", col="red", cex=1.2)
text(0.55,1.1, "TAN", col="green", cex=1.2)
#text(0.17,1.13, "CAF", col="red", cex=1.2)
#text(0.65,1.13, "TAN", col="green", cex=1.2)
mtext("CAF subpopulation signature", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
subtypeOrder <- c("LuminalA", "TNBC")
sampleOrderBySubtype <- sort(match(caf_es$Subtype, subtypeOrder),
index.return=TRUE)$ix
subtypeXtable <- table(caf_es$Subtype)
subtypeColorLegend <- c(LuminalA="red", TNBC="green")
geneSetOrder <- c("S4", "S3", "S1")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]
#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_cancer_subtype.png")
heatmap(assay(caf_es)[geneSetOrder, sampleOrderBySubtype], Rowv=NA,
Colv=NA, scale="row", margins=c(3,5), col=hmcol,
ColSideColors=rep(subtypeColorLegend[subtypeOrder],
times=subtypeXtable[subtypeOrder]),
labCol="",
#labCol = caf_es$Patient[sampleOrderBySubtype],
caf_es$Subtype[sampleOrderBySubtype],
labRow=paste(toupper(substring(geneSetLabels, 1,1)),
substring(geneSetLabels, 2), sep=""),
cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.4,1.1, "LuminalA", col="red", cex=1.2)
text(0.65,1.1, "TNBC", col="green", cex=1.2)
#text(0.35,1.13, "LuminalA", col="red", cex=1.2)
#text(0.85,1.13, "TNBC", col="green", cex=1.2)
mtext("CAF subpopulation signature", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
gradeOrder <- c("Grade_2", "Grade_3")
sampleOrderByGrade <- sort(match(caf_es$Grade, gradeOrder),
index.return=TRUE)$ix
gradeXtable <- table(caf_es$Grade)
gradeColorLegend <- c(Grade_2="red", Grade_3="green")
geneSetOrder <- c("S4", "S3", "S1")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]
#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_cancer_grade.png")
heatmap(assay(caf_es)[geneSetOrder, sampleOrderByGrade], Rowv=NA,
Colv=NA, scale="row", margins=c(3,5), col=hmcol,
ColSideColors=rep(gradeColorLegend[gradeOrder],
times=gradeXtable[gradeOrder]),
labCol="",
#labCol = caf_es$Patient[sampleOrderByGrade],
caf_es$Subtype[sampleOrderBySubtype],
labRow=paste(toupper(substring(geneSetLabels, 1,1)),
substring(geneSetLabels, 2), sep=""),
cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.34,1.1, "Grade 2", col="red", cex=1.2)
text(0.6,1.1, "Grade 3", col="green", cex=1.2)
#text(0.25,1.13, "Grade 2", col="red", cex=1.2)
#text(0.74,1.13, "Grade 3", col="green", cex=1.2)
mtext("CAF subpopulation signature", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
caf_idx <- which(colData(caf_es)$Tumor_JuxtaTumor == "tumor")
tan_idx <- which(colData(caf_es)$Tumor_JuxtaTumor == "juxtatumor")
df_plot_S1 <- cbind.data.frame(S1_ES = assay(caf_es)[1,], Tumor_JuxtaTumor = colData(caf_es)$Tumor_JuxtaTumor)
gsva_plot_s1 <- ggplot(df_plot_S1, aes(x = Tumor_JuxtaTumor, y = S1_ES)) +
geom_point(size = 2, # reduce point size to minimize overplotting
position = position_jitter(
width = 0.1, # amount of jitter in horizontal direction
height = 0 # amount of jitter in vertical direction (0 = none)
)
) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.title.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
ylab("S1 GSVA ES") #, axis.title.x = element_blank())
df_plot_S3 <- cbind.data.frame(S3_ES = assay(caf_es)[2,], Tumor_JuxtaTumor = colData(caf_es)$Tumor_JuxtaTumor)
gsva_plot_s3 <- ggplot(df_plot_S3, aes(x = Tumor_JuxtaTumor, y = S3_ES)) +
geom_point(size = 2, # reduce point size to minimize overplotting
position = position_jitter(
width = 0.1, # amount of jitter in horizontal direction
height = 0 # amount of jitter in vertical direction (0 = none)
)
) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.title.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
ylab("S3 GSVA ES") #, axis.title.x = element_blank())
df_plot_S4 <- cbind.data.frame(S4_ES = assay(caf_es)[3,], Tumor_JuxtaTumor = colData(caf_es)$Tumor_JuxtaTumor)
gsva_plot_s4 <- ggplot(df_plot_S4, aes(x = Tumor_JuxtaTumor, y = S4_ES)) +
geom_point(size = 2, # reduce point size to minimize overplotting
position = position_jitter(
width = 0.1, # amount of jitter in horizontal direction
height = 0 # amount of jitter in vertical direction (0 = none)
)
) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.title.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
ylab("S4 GSVA ES") #, axis.title.x = element_blank())
distribution_es_s1 <- df_plot_S1 %>%
ggplot(mapping = aes(S1_ES)) +
geom_density() +
facet_wrap(~Tumor_JuxtaTumor) +
theme_linedraw()
distribution_es_s3 <- df_plot_S3 %>%
ggplot(mapping = aes(S3_ES)) +
geom_density() +
facet_wrap(~Tumor_JuxtaTumor) +
theme_linedraw()
distribution_es_s4 <- df_plot_S4 %>%
ggplot(mapping = aes(S4_ES)) +
geom_density() +
facet_wrap(~Tumor_JuxtaTumor) +
theme_linedraw()
plot_grid(top_row, bottom_row, nrow = 2,
rel_heights = c(1.1, 0.9))
df_plot_S1$S1_ES %>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.88376, p-value = 0.009899
df_plot_S3$S3_ES %>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.82457, p-value = 0.000764
df_plot_S4$S4_ES %>%
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.8989, p-value = 0.0204
p < 0.05, data is not normally distributed. T-test may be unsuitable for comparing distributions. Carry out Wilcoxon rank-sum test (Mann-Whitney U test).
wilcox.test(x = df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "tumor"],
y = df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "juxtatumor"],
paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "tumor"] and df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "juxtatumor"]
## V = 12, p-value = 0.03418
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(x = df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "tumor"],
y = df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "juxtatumor"],
paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "tumor"] and df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "juxtatumor"]
## V = 10, p-value = 0.021
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(x = df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "tumor"],
y = df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "juxtatumor"],
paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "tumor"] and df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "juxtatumor"]
## V = 55, p-value = 0.2334
## alternative hypothesis: true location shift is not equal to 0
median(df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "tumor"])
## [1] -0.1687633
median(df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "juxtatumor"])
## [1] 0.211911
median(df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "tumor"])
## [1] -0.04473953
median(df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "juxtatumor"])
## [1] 0.420649
median(df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "tumor"])
## [1] 0.2539725
median(df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "juxtatumor"])
## [1] -0.1836231
We can see from the above that there is a difference between the enrichment scores for the gene signatures for the S1 and S3 subpopulaions between CAF and TAN. We can suggest that the TAN (juxta-tumour) samples are enriched for our S1 and S3 gene signatures compared to the genes outside the gene signatures, and that this enrichment is greater in the TAN samples than in the CAF samples.
results_combined <- cbind.data.frame(colData(caf_es)$names, colData(caf_es)$Patient, df_plot_S1$S1_ES, df_plot_S3$S3_ES, df_plot_S4$S4_ES, df_plot_S1$Tumor_JuxtaTumor, colData(caf_es)$Subtype, colData(caf_es)$Grade, colData(caf_es)$Histology)
colnames(results_combined) <- c("Sample", "Patient", "S1_ES", "S3_ES", "S4_ES", "Tumor_JuxtaTumor", "Subtype", "Grade", "Histology")
results_combined
## Sample Patient S1_ES S3_ES S4_ES Tumor_JuxtaTumor Subtype
## 1 4033 1 -0.2111870 0.2282682 0.4511291 tumor LuminalA
## 2 4034 1 0.1667547 0.2602349 -0.2104131 juxtatumor LuminalA
## 3 4027 2 -0.3911261 -0.5744436 0.5483698 tumor TNBC
## 4 4028 2 -0.3006402 -0.4322374 -0.4128925 juxtatumor TNBC
## 5 4112 3 0.3408958 -0.3317912 0.2026432 tumor LuminalA
## 6 4113 3 0.3192418 0.5494366 -0.1568331 juxtatumor LuminalA
## 7 4116 4 -0.3791867 -0.3831630 -0.6103717 tumor LuminalA
## 8 4117 4 -0.2337296 -0.2289328 -0.3081449 juxtatumor LuminalA
## 9 4214 5 0.2261749 0.5747017 0.5845368 tumor LuminalA
## 10 4215 5 0.2202032 0.3547532 0.2426352 juxtatumor LuminalA
## 11 4315 6 0.3229567 0.5416024 0.5350491 tumor LuminalA
## 12 4316 6 0.3736564 0.6030234 0.3862160 juxtatumor LuminalA
## 13 4340 7 -0.2435243 -0.6432975 -0.4944882 tumor LuminalA
## 14 4341 7 0.2036188 0.2515679 -0.4312309 juxtatumor LuminalA
## 15 4344 8 0.1594359 0.4920567 -0.2643276 tumor LuminalA
## 16 4345 8 -0.2293861 0.4467967 -0.3578373 juxtatumor LuminalA
## 17 3532 9 -0.2008142 0.3100232 -0.3087708 tumor LuminalA
## 18 3533 9 0.2411840 0.3945014 0.1800844 juxtatumor LuminalA
## 19 3536 10 -0.1367123 0.5282220 0.3053018 tumor LuminalA
## 20 3537 10 0.2672011 0.5898932 -0.2313765 juxtatumor LuminalA
## 21 4299 11 0.1797552 -0.3177473 -0.2203716 tumor LuminalA
## 22 4300 11 0.4438534 0.5556149 0.2794237 juxtatumor LuminalA
## 23 4722 12 -0.4150383 -0.4818981 0.5708118 tumor LuminalA
## 24 4723 12 -0.1783413 0.5212709 0.4818895 juxtatumor LuminalA
## Grade Histology
## 1 Grade_2 Lobular
## 2 Grade_2 Lobular
## 3 Grade_3 Ductal
## 4 Grade_3 Ductal
## 5 Grade_3 Ductal
## 6 Grade_3 Ductal
## 7 Grade_2 Lobular
## 8 Grade_2 Lobular
## 9 Grade_2 Lobular
## 10 Grade_2 Lobular
## 11 Grade_2 Ductal
## 12 Grade_2 Ductal
## 13 Grade_2 Lobular
## 14 Grade_2 Lobular
## 15 Grade_2 Ductal
## 16 Grade_2 Ductal
## 17 Grade_2 Ductal
## 18 Grade_2 Ductal
## 19 Grade_3 Lobular
## 20 Grade_3 Lobular
## 21 Grade_3 Ductal
## 22 Grade_3 Ductal
## 23 Grade_2 Lobular
## 24 Grade_2 Lobular
(Su et al. 2018) carried out differential expression analysis between chemoresistant and chemosensitive CAFs from breast cancer. It is possible to try and identify the resulting signature in our samples. We will use the list of upregulated genes in chemoresistant samples as our marker of chemoresistance and the genes downregulated in the chemoresistant cells as our marker of chemosensitivity.
The HGNCHelper package can be used to fix outdated gene symbols.
assay(chemo_es)
## 4033 4034 4027 4028 4112 4113
## Resist. 0.2511948 0.1835208 -0.2275412 -0.2921821 -0.1549386 0.2922223
## Sensit. -0.2884295 -0.2617528 -0.3135180 -0.4803626 0.3979869 0.5291087
## 4116 4117 4214 4215 4315 4316
## Resist. -0.3305799 -0.3323113 0.1485889 -0.2558840 0.2260738 0.2115296
## Sensit. -0.6315582 -0.3175666 0.4718418 0.3953624 0.5808599 0.5089124
## 4340 4341 4344 4345 3532 3533
## Resist. -0.2499012 0.2918001 -0.3521381 -0.2333033 -0.3312907 0.3284542
## Sensit. -0.3752032 -0.2284608 -0.3360215 0.2964242 -0.3923255 0.3729971
## 3536 3537 4299 4300 4722 4723
## Resist. 0.2863740 0.2172434 0.4309208 0.2723313 -0.2989202 -0.2909116
## Sensit. -0.2635225 0.3420735 -0.2614191 0.4182203 -0.2995230 -0.2494366
#rownames(assay(chemo_es)) <- c("Resistance", "Sensitive")
subpopulationOrder <- c("tumor", "juxtatumor")
sampleOrderBySubpopulation <- sort(match(chemo_es$Tumor_JuxtaTumor, subpopulationOrder),
index.return=TRUE)$ix
subpopulationXtable <- table(chemo_es$Tumor_JuxtaTumor)
subpopulationColorLegend <- c(tumor="red", juxtatumor="green")
geneSetOrder <- c("Resist.", "Sensit.")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]
#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_chemoresistance_caf_tan.png")
heatmap(assay(chemo_es)[geneSetOrder, sampleOrderBySubpopulation], Rowv=NA,
Colv=NA, scale="row", margins=c(3,5), col=hmcol,
ColSideColors=rep(subpopulationColorLegend[subpopulationOrder],
times=subpopulationXtable[subpopulationOrder]),
labCol="",
chemo_es$Tumor_JuxtaTumor[sampleOrderBySubpopulation],
labRow=paste(toupper(substring(geneSetLabels, 1,1)),
substring(geneSetLabels, 2), sep=""),
cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.285,1.1, "CAF", col="red", cex=1.2)
text(0.55,1.1, "TAN", col="green", cex=1.2)
#text(0.16,1.13, "CAF", col="red", cex=1.2)
#text(0.65,1.13, "TAN", col="green", cex=1.2)
mtext("Gene sets", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
subtypeOrder <- c("LuminalA", "TNBC")
sampleOrderBySubtype <- sort(match(chemo_es$Subtype, subtypeOrder),
index.return=TRUE)$ix
subtypeXtable <- table(chemo_es$Subtype)
subtypeColorLegend <- c(LuminalA="red", TNBC="green")
geneSetOrder <- c("Resist.", "Sensit.")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]
#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_chemoresistance_cancer_subtype.png")
heatmap(assay(chemo_es)[geneSetOrder, sampleOrderBySubtype], Rowv=NA,
Colv=NA, scale="row", margins=c(3,5), col=hmcol,
ColSideColors=rep(subtypeColorLegend[subtypeOrder],
times=subtypeXtable[subtypeOrder]),
labCol="",
#labCol = caf_es$Patient[sampleOrderBySubtype],
chemo_es$Subtype[sampleOrderBySubtype],
labRow=paste(toupper(substring(geneSetLabels, 1,1)),
substring(geneSetLabels, 2), sep=""),
cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.4,1.1, "LuminalA", col="red", cex=1.2)
text(0.65,1.1, "TNBC", col="green", cex=1.2)
#text(0.35,1.13, "LuminalA", col="red", cex=1.2)
#text(0.85,1.13, "TNBC", col="green", cex=1.2)
mtext("Gene sets", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
gradeOrder <- c("Grade_2", "Grade_3")
sampleOrderByGrade <- sort(match(chemo_es$Grade, gradeOrder),
index.return=TRUE)$ix
gradeXtable <- table(chemo_es$Grade)
gradeColorLegend <- c(Grade_2="red", Grade_3="green")
geneSetOrder <- c("Resist.", "Sensit.")
geneSetLabels <- geneSetOrder
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]
#png(filename = "/home/kevin/Documents/PhD/subtypes/caf-subtype-analysis/caf_rnaseq_combined_analysis_files/figure-gfm/gsva_chemoresistance_cancer_grade.png")
heatmap(assay(chemo_es)[geneSetOrder, sampleOrderByGrade], Rowv=NA,
Colv=NA, scale="row", margins=c(3,5), col=hmcol,
ColSideColors=rep(gradeColorLegend[gradeOrder],
times=gradeXtable[gradeOrder]),
labCol="",
#labCol = caf_es$Patient[sampleOrderByGrade],
chemo_es$Subtype[sampleOrderBySubtype],
labRow=paste(toupper(substring(geneSetLabels, 1,1)),
substring(geneSetLabels, 2), sep=""),
cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.34,1.1, "Grade 2", col="red", cex=1.2)
text(0.6,1.1, "Grade 3", col="green", cex=1.2)
#text(0.25,1.13, "Grade 2", col="red", cex=1.2)
#text(0.75,1.13, "Grade 3", col="green", cex=1.2)
mtext("Gene sets", side=4, line=0, cex=1.5)
mtext("Samples", side=1, line=4, cex=1.5, at = 0.42)
#dev.off()
chemoresistance_ratio <- function(df){
out <- data.frame(Patient = df$Patient,)
}
as_tibble(t(assay(chemo_es))) %>% mutate(Patient = colData(chemo_es)$Patient, Chemoresistance_ratio = abs(Resist.)/abs(Sensit.)) %>% print(n = 24)
## # A tibble: 24 × 4
## Resist. Sensit. Patient Chemoresistance_ratio
## <dbl> <dbl> <int> <dbl>
## 1 0.251 -0.288 1 0.871
## 2 0.184 -0.262 1 0.701
## 3 -0.228 -0.314 2 0.726
## 4 -0.292 -0.480 2 0.608
## 5 -0.155 0.398 3 0.389
## 6 0.292 0.529 3 0.552
## 7 -0.331 -0.632 4 0.523
## 8 -0.332 -0.318 4 1.05
## 9 0.149 0.472 5 0.315
## 10 -0.256 0.395 5 0.647
## 11 0.226 0.581 6 0.389
## 12 0.212 0.509 6 0.416
## 13 -0.250 -0.375 7 0.666
## 14 0.292 -0.228 7 1.28
## 15 -0.352 -0.336 8 1.05
## 16 -0.233 0.296 8 0.787
## 17 -0.331 -0.392 9 0.844
## 18 0.328 0.373 9 0.881
## 19 0.286 -0.264 10 1.09
## 20 0.217 0.342 10 0.635
## 21 0.431 -0.261 11 1.65
## 22 0.272 0.418 11 0.651
## 23 -0.299 -0.300 12 0.998
## 24 -0.291 -0.249 12 1.17
wilcox.test(x = df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "tumor"],
y = df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "juxtatumor"],
paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "tumor"] and df_plot_S1$S1_ES[df_plot_S1$Tumor_JuxtaTumor == "juxtatumor"]
## V = 12, p-value = 0.03418
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(x = df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "tumor"],
y = df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "juxtatumor"],
paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "tumor"] and df_plot_S3$S3_ES[df_plot_S3$Tumor_JuxtaTumor == "juxtatumor"]
## V = 10, p-value = 0.021
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(x = df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "tumor"],
y = df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "juxtatumor"],
paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "tumor"] and df_plot_S4$S4_ES[df_plot_S4$Tumor_JuxtaTumor == "juxtatumor"]
## V = 55, p-value = 0.2334
## alternative hypothesis: true location shift is not equal to 0
CIBERSORTx (Newman et al. 2019) is the most commonly used tool for cell-type deconvolution. It is a machine learning method based on support vector regression, which carries out batch correction, correcting for between-platform differences in expression data. In this case, the signature matrix was made using all of the available CAF subpopulation data (40 S1 samples, 25 S3 samples, 24 S4 samples). It is possible to infer the proportions of the different subpopulations as well as a subpopulation-specific gene expression matrix for each sample. Either scRNA-seq, bulk RNA-seq or microarray data can be used as the reference.
- Create signature matrix for CIBERSORTx
- TPM normalise data,probably optional
- Run CIBERSORTx to figure out proportions of S1, S3 and S4 in In-house samples
The files for CIBERSORT (mixture file, reference data and phenotype
classes file), were prepared using the cibersortx_prepare_files.R
script.
CIBERSORTx
was run using the following command:
docker run -v /home/kevin/Documents/PhD/cibersort/caf_subpopulation/infiles:/src/data -v /home/kevin/Documents/PhD/cibersort/caf_subpopulation/outfiles:/src/outdir cibersortx/fractions --username [email protected] --token b7f03b943ade9b4146dc2126b4ac9d19 --single_cell FALSE --refsample caf_subtypes_tpm_for_sig_matrix.txt --mixture caf_tpm_mixture.txt --rmbatchBmode TRUE --outdir /home/kevin/Documents/PhD/cibersort/caf_subpopulation/outfiles --phenoclasses /home/kevin/Documents/PhD/cibersort/caf_subpopulation/infiles/phenoclasses_caf.txt
cibersort_results <- read.table("/home/kevin/Documents/PhD/cibersort/caf_subpopulation/outfiles/CIBERSORTx_Adjusted.txt", header = T)
cibersort_results
## Mixture S1 S3 S4 P.value Correlation RMSE
## 1 4033 0.7910443 0 0.2089557492 0 0.6082765 0.8157847
## 2 4034 1.0000000 0 0.0000000000 0 0.6377965 0.8262784
## 3 4027 0.5530809 0 0.4469191109 0 0.7390362 0.6734956
## 4 4028 0.9540262 0 0.0459738305 0 0.5771597 0.8772853
## 5 4112 0.9145585 0 0.0854415380 0 0.6851607 0.7521922
## 6 4113 1.0000000 0 0.0000000000 0 0.7377976 0.7054262
## 7 4116 0.9202598 0 0.0797402206 0 0.6086764 0.8364088
## 8 4117 0.9749568 0 0.0250432452 0 0.6490463 0.8064734
## 9 4214 0.8948236 0 0.1051763746 0 0.7248746 0.7025445
## 10 4215 0.9586793 0 0.0413207259 0 0.7114689 0.7295204
## 11 4315 0.8479862 0 0.1520138235 0 0.7125195 0.7118186
## 12 4316 0.9996778 0 0.0003222288 0 0.7190471 0.7295308
## 13 4340 0.8708912 0 0.1291087778 0 0.6756717 0.7559901
## 14 4341 0.9423294 0 0.0576705911 0 0.6866700 0.7557415
## 15 4344 0.9305382 0 0.0694618285 0 0.6622681 0.7811234
## 16 4345 0.9217481 0 0.0782519211 0 0.6139309 0.8312952
## 17 3532 0.8895852 0 0.1104147845 0 0.5699035 0.8684824
## 18 3533 0.9919602 0 0.0080398226 0 0.6016636 0.8632189
## 19 3536 0.7865679 0 0.2134320528 0 0.5379829 0.8807957
## 20 3537 0.9395404 0 0.0604595827 0 0.5060439 0.9421126
## 21 4299 0.8210417 0 0.1789582923 0 0.7106510 0.7115838
## 22 4300 0.9567319 0 0.0432681184 0 0.7036329 0.7386330
## 23 4722 0.3774401 0 0.6225598787 0 0.8107501 0.5852533
## 24 4723 0.8577810 0 0.1422189828 0 0.6788132 0.7508034
The online version of CIBERSORTx was also used, using 500 permutations.
The study EGAD00001006144
was excluded from the signature matrix due
to its probable outlier status.
cibersort_results_online <- read.csv("/home/kevin/Documents/PhD/cibersort/caf_subpopulation/outfiles/online/2022-09-01-CIBERSORTx_Job6_Results_online_no6144_hgnc.csv", header = T)
cibersort_results_online
## Mixture S1 S3 S4 P.value Correlation RMSE
## 1 4033 0.9200710 0 0.07992904 0 0.6351293 1.1460059
## 2 4034 0.9239457 0 0.07605427 0 0.6657794 1.1107332
## 3 4027 0.8184646 0 0.18153543 0 0.5447931 1.1453912
## 4 4028 0.9313242 0 0.06867582 0 0.5780322 1.2291753
## 5 4112 0.9161738 0 0.08382623 0 0.6779809 1.0858067
## 6 4113 0.9474235 0 0.05257651 0 0.7795798 0.9763537
## 7 4116 0.9168721 0 0.08312792 0 0.5980791 1.1885625
## 8 4117 0.9124145 0 0.08758548 0 0.6446746 1.1252229
## 9 4214 0.9070862 0 0.09291375 0 0.6558631 1.1048196
## 10 4215 0.9332171 0 0.06678290 0 0.6664009 1.1204241
## 11 4315 0.9384081 0 0.06159186 0 0.6343094 1.1678461
## 12 4316 0.9313687 0 0.06863127 0 0.7470213 1.0065390
## 13 4340 0.9208098 0 0.07919018 0 0.6219298 1.1634929
## 14 4341 0.9116361 0 0.08836394 0 0.6582423 1.1067898
## 15 4344 0.9120902 0 0.08790977 0 0.6589903 1.1063207
## 16 4345 0.8916855 0 0.10831455 0 0.6391071 1.1093961
## 17 3532 0.9221877 0 0.07781229 0 0.5934655 1.2001776
## 18 3533 0.9401834 0 0.05981664 0 0.6295261 1.1759367
## 19 3536 0.8795555 0 0.12044448 0 0.5611606 1.1913771
## 20 3537 0.8907880 0 0.10921199 0 0.5571366 1.2084328
## 21 4299 0.8913230 0 0.10867704 0 0.6886893 1.0438535
## 22 4300 0.9472909 0 0.05270908 0 0.6613853 1.1431199
## 23 4722 0.6837608 0 0.31623923 0 0.5446944 1.0181402
## 24 4723 0.9100635 0 0.08993653 0 0.5722794 1.2120169
cibersort_grid
Using the deconvolution approach with CIBERSORTx, there seems to be none of the S3 subpopulation present. It is possible that the S3 subpopulation does not have a very well defined transcriptional profile.
Results look strange with P-value of 9999. CIBERSORTx was run using the Docker image and using the GUI, and different results were obtained. When the Docker image was used and the number of permutations was changed from 0 to 100, the p-value changed to 0.000. The proportions look different between the two methods, with the GUI predicting about 0.75 S1 with the rest being S4 for most samples.
Chu, Tinyi, Zhong Wang, Dana Pe’er, and Charles G. Danko. 2022. “Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology.” Nature Cancer 2022 3:4 3 (4): 505–17. https://doi.org/10.1038/s43018-022-00356-3.
Kieffer, Yann, Hocine R. Hocine, Géraldine Gentric, Floriane Pelon, Charles Bernard, Brigitte Bourachot, Sonia Lameiras, et al. 2020. “Single-cell analysis reveals fibroblast clusters linked to immunotherapy resistance in cancer.” Cancer Discovery 10 (9): 1330–51. https://doi.org/10.1158/2159-8290.CD-19-1384/333435/AM/SINGLE-CELL-ANALYSIS-REVEALS-FIBROBLAST-CLUSTERS.
Li, Yumei, Xinzhou Ge, Fanglue Peng, Wei Li, and Jingyi Jessica Li. 2022. “Exaggerated false positives by popular differential expression methods when analyzing human population samples.” Genome Biology 23 (1): 1–13. https://doi.org/10.1186/S13059-022-02648-4/FIGURES/2.
Newman, Aaron M., Chih Long Liu, Michael R. Green, Andrew J. Gentles, Weiguo Feng, Yue Xu, Chuong D. Hoang, Maximilian Diehn, and Ash A. Alizadeh. 2015. “Robust enumeration of cell subsets from tissue expression profiles.” Nature Methods 2015 12:5 12 (5): 453–57. https://doi.org/10.1038/nmeth.3337.
Newman, Aaron M., Chloé B. Steen, Chih Long Liu, Andrew J. Gentles, Aadel A. Chaudhuri, Florian Scherer, Michael S. Khodadoust, et al. 2019. “Determining cell type abundance and expression from bulk tissues with digital cytometry.” Nature Biotechnology 2019 37:7 37 (7): 773–82. https://doi.org/10.1038/s41587-019-0114-2.
Pelon, Floriane, Brigitte Bourachot, Yann Kieffer, Ilaria Magagna, Fanny Mermet-Meillon, Isabelle Bonnet, Ana Costa, et al. 2020. “Cancer-associated fibroblast heterogeneity in axillary lymph nodes drives metastases in breast cancer through complementary mechanisms.” Nature Communications 2020 11:1 11 (1): 1–20. https://doi.org/10.1038/s41467-019-14134-w.
Su, Shicheng, Jianing Chen, Herui Yao, Jiang Liu, Shubin Yu, Liyan Lao, Minghui Wang, et al. 2018. “CD10 + GPR77 + Cancer-Associated Fibroblasts Promote Cancer Formation and Chemoresistance by Sustaining Cancer Stemness.” Cell 172 (4): 841–856.e16. https://doi.org/10.1016/J.CELL.2018.01.009.
Szolek, András, Benjamin Schubert, Christopher Mohr, Marc Sturm, Magdalena Feldhahn, and Oliver Kohlbacher. 2014. “OptiType: precision HLA typing from next-generation sequencing data.” Bioinformatics 30 (23): 3310–16. https://doi.org/10.1093/BIOINFORMATICS/BTU548.
Zhao, Shanrong, Zhan Ye, and Robert Stanton. 2020. “Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols.” RNA 26 (8): 903. https://doi.org/10.1261/RNA.074922.120.