Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effect of over- and underexpression on itselves #4

Open
holgerman opened this issue Nov 27, 2017 · 5 comments
Open

Effect of over- and underexpression on itselves #4

holgerman opened this issue Nov 27, 2017 · 5 comments

Comments

@holgerman
Copy link

holgerman commented Nov 27, 2017

Hi Daniel,

thank you very much for sharing this work. As a computational biologist, this data seems very interesting for lookup of hypothesis won in another dataset in a wet lab data, great!

I had a look at the datasets you kindly provided in https://github.com/dhimmel/lincs/tree/gh-pages/data/consensi and checked the effect of overexpression/underexpression of a gene as perturbagen on itself:

About a third of the genes showed nominal significant (z score <= -1.96) underexpression when it was itself the repressing perturbagen. When looking on overexpression, about 10 percent of genes showed overexpression when they were the overexpressed perturbagen itself.

My first question is: While this is truly a clear enrichment in the right direction, is this rather low efficiency of a gene as perturbagen on itself expected?

My second question is: Do you suggest to filter for genes that have an effect as perturbagen on itself for quality control?

To illustrate this issue, here is a histogram of z-scores showing effect as perturbagen on itselves vs. effect on other genes:
s309_1_distribution_zscores_over_under_itselves_effect

Thanks and best, Holger

@dhimmel
Copy link
Owner

dhimmel commented Nov 27, 2017

@holgerman thanks for playing with the data and your interest.

here is a histogram of z-scores showing effect as perturbagen on itselves vs. effect on other genes

To make sure I understand, these plots show the z-score distributions for perturbation target genes (effect_itselves == True) and all non-target genes (effect_itselves == False)? So while there are dysregulated non-target genes that have absolute s-zcores > 5, the tails of this distribution are small enough that they aren't visible above in red?

We have noticed similar issues before, where many perturbations don't significantly dysregulate their target in the expected direction. In this discussion, we attempt to diagnose the issue. In particular, we found that measured target genes tended to be dysregulated in the expected direction, while imputed target genes did not. Therefore, we conclude the issue is likely primarily due to poor imputation quality in the original LINCS data (even within the BING "best inferred gene set" genes).

In the Project Rephetio manuscript, we summarize:

The consensus signatures for genetic perturbations allowed us to assess various characteristics of the L1000 dataset. First, we looked at whether genetic interference dysregulated its target gene in the expected direction (Himmelstein, 2016c). Looking at measured z-scores for target genes, we found that the knockdown perturbations were highly reliable, while the overexpression perturbations were only moderately reliable with 36% of overexpression perturbations downregulating their target. However, imputed z-scores for target genes barely exceeded chance at responding in the expected direction to interference. Hence, we concluded that the imputation quality of LINCS L1000 is poor. However, when restricting to significantly dyseregulated targets, 22 out of 29 imputed genes responded in the expected direction. This provides some evidence that the directional fidelity of imputation is higher for significantly dysregulated genes. Finally, we found that the transcriptional signatures of knocking down and overexpressing the same gene were positively correlated 65% of the time, suggesting the presence of a general stress response (Himmelstein et al., 2016o).

@holgerman
Copy link
Author

holgerman commented Nov 28, 2017

@dhimmel thank you for giving me the right directions! And congrats for your truly collaborative eLIFE paper!

To make sure I understand, these plots show the z-score distributions for perturbation target genes (effect_itselves == True) and all non-target genes (effect_itselves == False)? So while there are dysregulated non-target genes that have absolute s-zcores > 5, the tails of this distribution are small enough that they aren't visible above in red?

Yes, this is right, I just contrasted any non-target pair with the targeted pair. R ggplot fixes in vertical facetting the scaling of the x-axis for the largest range. Overexpression z values of red non-target pairs ranged from -19.074 to 38.136 and knock-down of non-targeted pairs ranged from -58.775 to 45.133 with tails not visible in the graph above.

We have noticed similar issues before, where many perturbations don't significantly dysregulate their target in the expected direction. In this discussion, we attempt to diagnose the issue. In particular, we found that measured target genes tended to be dysregulated in the expected direction, while imputed target genes did not. Therefore, we conclude the issue is likely primarily due to poor imputation quality in the original LINCS data (even within the BING "best inferred gene set" genes).

Thanks, this insight was very helpful, also the discussion in Himmelstein et al., 2016o! As you discussed that a general stress response might be the reason, I had a closer look at the contrasts used for calculating the z scores in the LINCS data. My motivation for this was that the general stress response must be weaker in controls to get manifested in the z score.

From this GitHub issue I understood that your modzs.gctx object corresponds to LINCS level 4 data. GSE70138 says that there are two version of z-scores available in level 4:

Level 4 (Z-SCORES) - signatures with differentially expressed genes computed by robust z-scores for each profile relative to control (PC relative to plate population as control; VC relative to vehicle control).

In this discussion you wrote

The z-scores compare a gene's expression level in cells given the perturbation to cells without the perturbation (controls). I believe the controls account for the non-specific disturbances caused by delivering the molecular payload, but will confirm.

Does this mean you used VC controls in your method? As vehicles the pert_info file includes as pert_type ctl_vehicle DMSO, PBS, and water.
As far I understood from this documentation of a LINCS contest, PC control means using as controls all other perturbagens on the same plate:

Q: Are there control treatments in the dataset?
A: Yes - DMSO is the control for compound treatments. Empty vector and other forms of non-gene-coding inserts (e.g LacZ, GFP, etc) are controls for genetic perturbagens.

Q: How are differential signatures computed?
A: We take the difference between a treatment of interest and all other perturbagens on the same 384-well assay plate (this is referred to as a population control). In other forms of analyses we compare the treatment to a control such as DMSO or an empty vector.  In our experience, use of a population control is a more rigorous form of signature generation because it is less sensitive to variations arising from inert perturbagens (which are seldom truly inert).

Interestingly, the Ma'ayan lab suggested in this youtube lecture at 18' 30'' using the PC. However, the independent calculation of Level 5 as the third version of z scores by the Ma’ayan lab themselves using their characteristic direction method described in their paper sounds like - if I understood correctly - as they used VC:

A CD unit vector was calculated for each experiment replicate in comparison with all the control replicates on the same plate.

I will ask them about it.

And, finally, would you think that using a different definition of the level 4 z-scores might help at least a bit to improve the problem of low quality of imputed L1000 genes?

Thanks again for your time!

@dhimmel
Copy link
Owner

dhimmel commented Nov 28, 2017

I understood that your modzs.gctx object corresponds to LINCS level 4 data

That sounds right, although I don't remember using the "level 4" terminology. When we were accessing the data through LINCS Cloud, I don't believe the GEO upload existed.

Frankly I'm not sure whether the modzs.gctx file that we downloaded from the now defunct LINCS Cloud used a vehicle or population control when calculating differential expression z-scores. I archived modzs.gctx on figshare where I noted:

I originally retrieved modzs.gctx from the following path on the L1000 C3 Cloud (c3.lincscloud.org): /xchip/cogs/data/build/a2y13q1/modzs.gctx.

Perhaps you can get in touch with someone from the LINCS L1000 team and inquire about what control was used for this file. @tnat1031 (Ted Natoli) was very helpful during the online L1000 office hours. Perhaps he will know which control was used. I don't remember there being an option in the past, so its possible that only vehicle control existed when we did our analyses and population control is newer?

Anyways, @holgerman I do think a population control could be preferable. Removing the general stress response would be valuable. I'm not sure how this would relate to the imputation quality issue. Since I believe genes are imputed prior to the control stage, its possible the quality would not change.

If it turns out that more robust controls or differential expression data is now available, a pull request to update this repo would be of interest.

@tnat1031
Copy link

tnat1031 commented Dec 1, 2017

@dhimmel Yes, the old modzs.gctx file you obtained from lincslcoud was indeed generated using the population-based z-scoring procedure, aka ZSPC.

@holgerman Thanks for your interest in the data. I agree with @dhimmel that when considering the effect of a perturbagen on the specific gene it is designed to target, the directly measured (aka landmark) genes will be much more reliable. It has been challenging using current imputation approaches to reliably predict extreme modulations of non-measured transcripts, but this is an area we're actively exploring.

Could you share more about the type of research you're doing and the questions you want to address with this data? Also, you may find our documentation and other resources at clue.io helpful.

Thanks a lot,
Ted

@holgerman
Copy link
Author

holgerman commented Dec 5, 2017

Dear @tnat1031 , thank you very much for providing this information!
Here are the details:
I was interested to compare the effect of an eQTL with that of a pertubagen in LINCS. As an example for such an eQTL, a certain genetic variant might have a typically strong effect on gene expression levels of a local neighboring gene. Frequently, the genetic variant has also an effect on gene expression of distant genes. All these effects are typically observed as associations between the genetic variant and the gene of interest.
Of course, association is not causation, but causation is of more interest. This is where LINCS data might be helpful. If the effect on the distant gene is strongly mediated by the local gene and not a direct effect of the genetic variant, I would expect to see a similar effect on these distant genes in LINCS when the local gene is challenged by a pertubagen resulting in its overexpression / silencing.
Do you see certain limitations for this application of LINCS data?

@dhimmel I got also information from the Ma'ayan lab regarding their controls. Indeed, they did not use population controls but VC (vehicle controls) for their characteristic direction method. In the video they ment that the Broad preferred PC for their z-score method, this was not related to their own work. They did not compare yet their methods performance between VC and PC on this dataset and used VC because they regarded it the typical design for a single microarray experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants