Inconsistent region numbers #10

gtzheng · 2024-02-15T15:13:54Z

Here is the code I used for tokenization:

from geniml.tokenization import ITTokenizer # or any other tokenizer
from geniml.io import RegionSet
import numpy as np

bed_file = '/project/shefflab/data/encode/ENCFF001SMI.bed.gz'
universe_file = '/project/shefflab/data/screen/cCREs.bed'

rs = RegionSet(bed_file)
tokenizer = ITTokenizer(universe_file)

print("number of regions:", len(rs))

all_tokens = tokenizer.tokenize(rs)
print("number of regions after tokenization:", len(all_tokens))

sel_indexes = np.random.permutation(len(rs))[0:200]
sel_rs = RegionSet([rs[i] for i in sel_indexes])
print("number of selected regions:", len(sel_rs))
tokens = tokenizer.tokenize(sel_rs)
print("number of selected regions after tokenization:", len(tokens))

The numbers of regions after tokenization seem inflated.

number of regions: 1045378
number of regions after tokenization: 1055185
number of selected regions: 200
number of selected regions after tokenization: 201

nleroy917 · 2024-02-15T18:52:00Z

Hey @gtzheng
Thanks for opening this up. I guess I have two thoughts:

There definitely could be a bug or quirk and I'd love to investigate the tokenizers more to ensure they are working properly,
One reason you might experience an inflated amount of regions is because when tokenizing... I collect all regions it overlaps... if that makes sense. Here is a graphic:

One region in the query ends up overlapping with two universe regions so you will tokenize a single region and get back two tokens.

One thing I like to do as a sanity check is to use bedtools overlap to make sure the tokenizer is hitting things properly.

Let me know if you have any other thoughts!

gtzheng · 2024-02-15T19:14:30Z

I see. Thanks for the rely!
It does make sense to see inflated numbers if all overlapped regions are collected. Would it also be possible to return the most overlapped regions, so the output has the same number of regions as the input?

nleroy917 · 2024-02-15T19:15:59Z

Yeah, one option is just to return the first region it overlaps. I like the idea of returning the single one. It overlaps with the most, however.

In terms of signal, biological meaning however - I'm not sure which is better...

gtzheng · 2024-02-15T19:29:00Z

Perhaps make it an option for users to choose.

Would it be possible to get how much two regions overlap? If so, we can use that information for soft tokenization.

gtzheng added the bug Something isn't working label Feb 15, 2024

gtzheng added question Further information is requested and removed bug Something isn't working labels Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent region numbers #10

Inconsistent region numbers #10

gtzheng commented Feb 15, 2024

nleroy917 commented Feb 15, 2024 •

edited

Loading

gtzheng commented Feb 15, 2024

nleroy917 commented Feb 15, 2024

gtzheng commented Feb 15, 2024

Inconsistent region numbers #10

Inconsistent region numbers #10

Comments

gtzheng commented Feb 15, 2024

nleroy917 commented Feb 15, 2024 • edited Loading

gtzheng commented Feb 15, 2024

nleroy917 commented Feb 15, 2024

gtzheng commented Feb 15, 2024

nleroy917 commented Feb 15, 2024 •

edited

Loading