Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changing clustering threshold in bindetect #256

Open
sleepryansleep opened this issue Mar 13, 2024 · 3 comments
Open

changing clustering threshold in bindetect #256

sleepryansleep opened this issue Mar 13, 2024 · 3 comments

Comments

@sleepryansleep
Copy link

Hi, I was wondering if there was a way to change the clustering distance threshold that BINDetect uses to form motif clusters from the default (which I believe is 0.5)? If not, I believe it may be a useful feature to have.

Additionally, it may be useful to be able to import the clustering schema from one dataset into another analysis.
ie generating motif clustering in BINDetect based on one dataset then applying those same clusters to a BINDetect analysis of a distinct dataset. While we'd expect the clusters to be more or less the same, they might vary slightly depending on the regions selected, in particular for those TFs whose distance is close to the threshold.

Thank you!

ryan

@hschult hschult mentioned this issue Mar 14, 2024
@hschult
Copy link
Collaborator

hschult commented Mar 15, 2024

Hi Ryan,

thank you for using TOBIAS! Yes, you are right the default BINDetect clustering threshold is 0.5 but there was no way to change it. I have added a parameter to address this (--cluster-threshold). It is already included in the dev branch and will be in the next release. Install our development version to try it:

pip install git+https://github.com/loosolab/TOBIAS@dev

For your second question, I'm not sure what you mean by "import the clustering schema". However, if you want to compare the clustering between two separate BINDetect runs you could compare the assigned groups as given in the bindetect_results.txt files. A different approach would be to use the --output-peaks parameter. This can be used to essentially create a subset on the initial peaks and do analysis only on the subset. So you could limit your analysis to shared peaks.

I hope this answers your question.

Hendrik

@sleepryansleep
Copy link
Author

sleepryansleep commented Mar 15, 2024

Hi, thanks for the response and it's nice that the first part of the question will be a feature in upcoming versions.

What I meant by the second part of my question was let's say I performed a BINDetect analysis of 2Cell compared to 4Cell embryos and the clustering produces two POU clusters, POU_A and POU_B. Let's say I want to also show BINDetect data for 8Cell vs ICM. The peaks would be different in this dataset and thus the clustering would be slightly different. Perhaps this dataset produced three POU clusters POU_C, POU_D, and POU_E.

If I plan to have a panel showing how clusters behave in 2Cell vs 4Cell or 8Cell vs ICM, I would prefer that each of the panels show the exact same clustering such that the POU clusters in one panel would each consist of the exact same individual TFs as the next panel. So my question/comment is that it would be nice to be able to tell the 8Cell vs ICM analysis that I want to use the exact same clusters from the 2Cell vs 4Cell analysis. (or more likely, what I'd do is run a separate analysis in which I combined all the peaks, allow clusters to form, then repeat the 2Cell vs 4Cell and 8Cell vs ICM analyses using the clusters from the combined analysis). I hope this made sense? Thank you.

ryan

@hschult
Copy link
Collaborator

hschult commented Mar 18, 2024

Yes, I understand now. You want to create clusters independent of the underlying condition so they can be used to compare multiple analyses. While it should work to combine the peaks and use them to form clusters I would recommend a different approach as clusters created by BINDetect will be always dependent on the underlying peaks. Here is an excerpt from the Supplementary Information of our paper that describes how the clustering is done:

 hierarchical clustering of the TFBS-distance matrix is shown and all TFs with distances less than 0.5 (overlap of over 50%
of base pairs) are colored as separate clusters.

So the clustering is done by comparing the base pair overlap of transcription factor binding sites (TFBS), hence why the clustering is peak dependent. For a clustering independent of the data, I would recommend doing a TF-motif-based clustering in other words group TFs based on their binding motif. Depending on the TF motifs you are using you could create the clustering yourself. This can be done using the MotifList class of TOBIAS. Or you could use pre-computed clusters as can be found for example in the JASPAR database (here).

I hope that helps. Let me know if something is unclear.
Hendrik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants