Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Less than 80% of the genes [...] warning with cisTarget step + Pruning gets rid of many regulons #534

Open
VincentGardeux opened this issue Mar 21, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@VincentGardeux
Copy link

VincentGardeux commented Mar 21, 2024

Hi,

I'm setting up pySCENIC for generating regulons from two Fly brain scRNAseq datasets we generated. The goal is to reproduce to some extent what was done in the aging Drosophila brain atlas (Davie et al, 2018). I especially want to reproduce the separation in Figure 4 based on (dati, pros, Imp, scro) regulons.

I'm running pySCENIC v0.12.1 in JupyterHub through Docker, with the latest motif and feather files (v10): mc_v10_clust
Everything seems to run ok using this (slightly tuned) tutorial.

I'm writing this issue because I think that there is an unwanted behavior in the prune2df function. And also as a guide for future users, since I struggled quite a lot to make this work.

First, I would encourage people to use the pySCENIC version (instead of the R, or the nextflow pipeline, whose maintenance is discontinued or doesn't work with the latest databases).

When running the tutorial, I got many warnings with the prune2df function. But the run still finished, and I ended up with 46 regulons... which did not seem like a lot. Especially because in the (Davie et al, 2018) paper the authors appear to have 163 regulons (Table S4). The main issue is that I didn't get the (dati, pros, scro) important regulons that I was looking for.

  • Problem: When running the cisTarget part (specifically, the prune2df function), I quickly get tons of Warnings: Less than 80% of the genes in ... could be mapped to ... which was already reported (but never really solved) in other issues: pyscenic warning: less than 80% of the genes in xxx could be mapped to ... #466 [BUG] pyscenic.transform warning/error #515 [BUG]pyscenic ctx --no_pruning #506 Empty file output from pyscenic ctx using original/modified databases #325 Empty regulons after cisTarget pruning step[results] #177 and pull requests: Changeable cutoff for number of genes found in DB for module2df #387

  • Problem: Kinda related to problem 1, but the number of regulons I get is very small as compared to what I would expect

  • Solution 1: I found out, that using the rho_mask_dropouts=True option in the step before (modules_from_adjacencies), it increases the number of regulons to 81, but still without my looked-for regulons (and still with all the warnings).

  • Solution 2: So I've set up to remove the warnings in the prune2df function. For this, I thought that the problem was coming from the *feather files, where genes were not overlapping with the genes I have in my expression matrix. So, I've extracted the genes used in the feather file, transformed them to Flybase/Ensembl IDs (btw using gene symbols instead of IDs is not a good idea I think), and updated the gene names in my expression matrix. So now, almost all of the feather genes are in my expression matrix. Then I've run again the pipeline, but was very surprised that it did not solve the warning issues... which did not make sense to me? It still worked to some extent, since got 99 regulons, but the increase was very marginal. And I did not understand why it was not solving the warning issues.

  • Solution 3: [Bug?] So I've thought that maybe the issue was coming from the fact that my expression matrix was containing extra genes that were NOT in the .feather files. So I've added this piece of code at the beginning of my script, to restrict my expression matrix to ONLY genes present in the .feather file:

import pandas as pd
# ex_matrix is my expression matrix
ranking_feather = pd.read_feather("dm6_v10_clust.genes_vs_motifs.rankings.feather")
overlap_values = ex_matrix.index[pd.Series(ex_matrix.index).isin(ranking_feather.columns)].unique()
ex_matrix = ex_matrix.loc[overlap_values, :]

This solves completely the warning issues in the pruning step, and I end up with 108 regulons. But this is not an expected behavior of the method, is it? I mean I understand the issues if some genes of the feather database are missing from the expression matrix, but it looks ok to have extra genes in your expression matrix? So why would this cause a warning? And limits the number of regulons generated?

  • Solution 4: Just to complete my post. Even with all these updates, I could not get the regulons I was looking for. I was finally able to get the regulons I wanted (total of 156) by setting the auc_threshold=0.01 parameter in the prune2df function (instead of 0.05 by default). I'm not sure though what is the real impact of this, as I could not find clear explanation of what this parameter is doing. Another way is to completely deactivate the pruning/filtering by using the filter_for_annotation=False argument in the prune2df function.

Hope this can help other users. And to the authors/maintainers please give me your opinion on point 3, as I really thought this behavior was weird/unexpected?

@VincentGardeux VincentGardeux added the bug Something isn't working label Mar 21, 2024
@caochch
Copy link

caochch commented Apr 8, 2024

Any progresses? What's the expected number of modules or regulons in the ctx step (the row of output files)?

@Flu09
Copy link

Flu09 commented Jun 24, 2024

any news?

@rrydbirk
Copy link

Can confirm that following solution 2 and 3 improves no. regulons significantly! Also, completely omitted warnings for me.

@colin893
Copy link

colin893 commented Jul 3, 2024

Hi. I have an issue while running pyscenic on Zebrafish dataset. I downloaded the Stanaka motifs database and processed my pyscenic analysis. Just to add, I successfully ran pyscenic on both mouse and human datasets.

First I got an error due to the wrong version format of the file that I converted into v2 without problem. Then I generated the adj.csv file that threw no error and is filled. The problem arises with the command pyscenic ctx - I got the usual warnings "Less than 80% of the genes in Regulon for XXXX could be mapped to v2_zf1.genes_vs_motifs.rankings. Skipping this module." but at the end, my reg.csv file is empty.

So I imported the genes from the feather file of Stanaka, and intersected these with the genes from my expression matrix, just to ensure that there is a significant overlap between gene names. For respectively 23000 and 20000 genes in both lists, I got +- 15000 genes in common. Given this, in my understanding, I would expect some regulons to survive the pruning step?
So I am wondering if it's not because of something else?

Thank you in advance for your time,

@VincentGardeux
Copy link
Author

Hi @colin893,

I'm still following the updates on this issue, but so far did not see any answer from the devs/authors. :(

For your issue, did you try all the solutions I suggested in my post? In my case they substantially increased the number of regulons.

If you still loose all your regulons at the pruning stage, you can probably remove the pruning completely using filter_for_annotation=False
I have to admit that I tend to do that now, because my "preferred regulons" are always filtered out for reasons I don't understand.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants