Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent region numbers #10

Open
gtzheng opened this issue Feb 15, 2024 · 4 comments
Open

Inconsistent region numbers #10

gtzheng opened this issue Feb 15, 2024 · 4 comments
Labels
question Further information is requested

Comments

@gtzheng
Copy link

gtzheng commented Feb 15, 2024

Here is the code I used for tokenization:

from geniml.tokenization import ITTokenizer # or any other tokenizer
from geniml.io import RegionSet
import numpy as np

bed_file = '/project/shefflab/data/encode/ENCFF001SMI.bed.gz'
universe_file = '/project/shefflab/data/screen/cCREs.bed'

rs = RegionSet(bed_file)
tokenizer = ITTokenizer(universe_file)

print("number of regions:", len(rs))

all_tokens = tokenizer.tokenize(rs)
print("number of regions after tokenization:", len(all_tokens))

sel_indexes = np.random.permutation(len(rs))[0:200]
sel_rs = RegionSet([rs[i] for i in sel_indexes])
print("number of selected regions:", len(sel_rs))
tokens = tokenizer.tokenize(sel_rs)
print("number of selected regions after tokenization:", len(tokens))

The numbers of regions after tokenization seem inflated.

number of regions: 1045378
number of regions after tokenization: 1055185
number of selected regions: 200
number of selected regions after tokenization: 201
@gtzheng gtzheng added the bug Something isn't working label Feb 15, 2024
@nleroy917
Copy link
Member

nleroy917 commented Feb 15, 2024

Hey @gtzheng
Thanks for opening this up. I guess I have two thoughts:

  1. There definitely could be a bug or quirk and I'd love to investigate the tokenizers more to ensure they are working properly,
  2. One reason you might experience an inflated amount of regions is because when tokenizing... I collect all regions it overlaps... if that makes sense. Here is a graphic:
image

One region in the query ends up overlapping with two universe regions so you will tokenize a single region and get back two tokens.

One thing I like to do as a sanity check is to use bedtools overlap to make sure the tokenizer is hitting things properly.

Let me know if you have any other thoughts!

@gtzheng
Copy link
Author

gtzheng commented Feb 15, 2024

I see. Thanks for the rely!
It does make sense to see inflated numbers if all overlapped regions are collected. Would it also be possible to return the most overlapped regions, so the output has the same number of regions as the input?

@nleroy917
Copy link
Member

Yeah, one option is just to return the first region it overlaps. I like the idea of returning the single one. It overlaps with the most, however.

In terms of signal, biological meaning however - I'm not sure which is better...

@gtzheng
Copy link
Author

gtzheng commented Feb 15, 2024

Perhaps make it an option for users to choose.

Would it be possible to get how much two regions overlap? If so, we can use that information for soft tokenization.

@gtzheng gtzheng added question Further information is requested and removed bug Something isn't working labels Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants