-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PanicException during tokenization #20
Comments
The main issue is that the
Therefore, this issue can be resolved if I just add a few safeguards around my code to check a few things before tokenization and raise appropriate exceptions otherwise from within Rust. This can be easily done using pyo3. |
However, now that i am thinking about it... there were a lot of changes to the tokenizers code, so I wonder if this will be resolved anyways when I finally get the new release out. |
I added an extra try:
regions = RegionSet(file)
tokens = tokenizer.tokenize(regions)
tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
write_tokens_to_gtok(tokens_file, tokens)
except Exception as e:
with open(failed_files, "a") as f:
f.write(f"{file}\t{e}\n")
except BaseException as be:
with open(failed_files, "a") as f:
f.write(f"{file}\t{be}\n") In the output file to catch failed BED files and exceptions, those are major Exceptions with file examples: Early end?
Empty CSV
CSV parse error
Garbled files?
TypeError
1?
|
Yeah, that's fine just be aware that can mess with things. If it helps now in the short term that's fine, but hopefully the changes can help out here |
Issue is that the regions are stored as:
Issue with this is that the regions are stored with commas in the start and ends:
These are resolved too.
These contain headers in them that mess with things. Example: Also solved
yeah seems empty to me...
Yeah these seem empty. Solved |
I think the conclusion was to not catch these exceptions, so I wouldn't do this :). |
Are you talking about Claude's code? try:
regions = RegionSet(file)
tokens = tokenizer.tokenize(regions)
tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
write_tokens_to_gtok(tokens_file, tokens)
except Exception as e:
with open(failed_files, "a") as f:
f.write(f"{file}\t{e}\n")
except BaseException as be:
with open(failed_files, "a") as f:
f.write(f"{file}\t{be}\n") Because on the Rust side I am bubbling up the correct exceptions that should be catchable |
For example... using the new tokenizers yields this: >>> from geniml.tokenization.main import TreeTokenizer
>>> t = TreeTokenizer.from_pretrained("databio/r2v-luecken2021-hg38-v2")
>>>
>>> try:
... t("GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.g")
... except Exception as e:
... print(e)
...
The file GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.g does not exist.
>>> try:
... t("GSM4310198_H69_5TGF_beta_3_SMAD3_peaks.bed.gz")
... except Exception as e:
... print(e)
...
BED file line does not have at least 3 fields: track name="..." Which seems to indicate that we caught it correctly instead of panicking and throwing an un |
@ClaudeHu can you confirm if this is solved or not? |
While running this code (based on pretokenization code):
This error occurred:
The text was updated successfully, but these errors were encountered: