Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PanicException during tokenization #20

Open
ClaudeHu opened this issue May 21, 2024 · 9 comments
Open

PanicException during tokenization #20

ClaudeHu opened this issue May 21, 2024 · 9 comments
Labels
bug Something isn't working enhancement New feature or request likely solved

Comments

@ClaudeHu
Copy link
Member

While running this code (based on pretokenization code):

import os
import multiprocessing as mp
import sys
from pathlib import Path

from rich.progress import Progress, track

from genimtools.utils import write_tokens_to_gtok
from geniml.io import RegionSet
from geniml.tokenization import ITTokenizer

sys.path.append(os.path.abspath("../utils"))
from file_utils import load_dict


def main():
    """
    based on https://github.com/databio/scripts/blob/master/model-training/region2vec-encode/pretokenize.py
    """

    data_path = os.path.expandvars("$GEO_BED_FOLDER")
    metadata_path = os.path.expandvars("../data/metadata/GEO_external")
    tokens_dir = os.path.expandvars("$GEO_DATASET/tokens")
    universe_path = os.path.expandvars("$GENIML_DATASET/encode/universe.bed")
    failed_files = os.path.expandvars("$GEO_DATASET/failed_files.txt")

    # init tokenizer
    tokenizer = ITTokenizer(universe_path)

    # metadata of GEO hg38 BED
    series_dict = load_dict(os.path.join(metadata_path, "series.json"))
    sample_dict = load_dict(os.path.join(metadata_path, "sample.json"))

    # make metadata df

    if not os.path.exists(tokens_dir):
        os.makedirs(tokens_dir)

    files = []

    for gse in series_dict:
        samples = series_dict[gse]
        for gsm in samples:
            files.extend([f"{data_path}/{gse}/{file}" for file in sample_dict[gsm]])

    for file in track(files, total=len(files), description="Tokenizing"):
        try:
            regions = RegionSet(file)
            tokens = tokenizer.tokenize(regions)
            tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
            write_tokens_to_gtok(tokens_file, tokens)
        except Exception as e:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{e}\n")


if __name__ == "__main__":
    main()

This error occurred:

thread '<unnamed>' panicked at src/tokenizers/tree_tokenizer.rs:117:74:
called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/sfs/qumulo/qhome/zh4nh/training/text2bed_encode_geo/data_preprocessing/external_test_set_pretokenize.py", line 58, in <module>
    main()
  File "/sfs/qumulo/qhome/zh4nh/training/text2bed_encode_geo/data_preprocessing/external_test_set_pretokenize.py", line 49, in main
    tokens = tokenizer.tokenize(regions)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zh4nh/.conda/envs/my-env/lib/python3.11/site-packages/geniml/tokenization/main.py", line 152, in tokenize
    result = self._tokenizer.tokenize(list(query))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
@nleroy917
Copy link
Member

The main issue is that the try-except doesn't catch the error like one would expect... See this stack overflow for more detailed info. The TL;DR is

Your except doesn't work because as pyo3 documents PanicException derives from BaseException (like SystemExit or KeyboardError) as it's not necessarily safe (given not all rust code is panic safe, pyo3 does not assume a rust-level panic is innocuous).

Therefore, this issue can be resolved if I just add a few safeguards around my code to check a few things before tokenization and raise appropriate exceptions otherwise from within Rust. This can be easily done using pyo3.

@nleroy917
Copy link
Member

However, now that i am thinking about it... there were a lot of changes to the tokenizers code, so I wonder if this will be resolved anyways when I finally get the new release out.

@ClaudeHu
Copy link
Member Author

I added an extra except statement to skip PanicException:

        try:
            regions = RegionSet(file)
            tokens = tokenizer.tokenize(regions)
            tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
            write_tokens_to_gtok(tokens_file, tokens)
        except Exception as e:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{e}\n")
        except BaseException as be:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{be}\n")

In the output file to catch failed BED files and exceptions, those are major Exceptions with file examples:

Early end?

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612215_Neutrophil_US_CTCF_ChIPseq_peaks.bed.bz2	Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612211_Neutrophil_US_RAD21_ChIPseq_peaks.bed.bz2	Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612220_Neutrophil_Ecoli_CTCF_ChIPseq_peaks.bed.bz2	Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612223_Neutrophil_US_H3K4me1_ChIPseq_peaks.bed.bz2	Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612229_Neutrophil_PMA_H3K4me3_ChIPseq_peaks.bed.bz2Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612227_Neutrophil_US_H3K4me3_ChIPseq_peaks.bed.bz2	Compressed file ended before the end-of-stream marker was reached
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126755/GSM3612218_Neutrophil_PMA_CTCF_ChIPseq_peaks.bed.bz2	Compressed file ended before the end-of-stream marker was reached

Empty CSV

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171074/GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.gz	Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171074/GSM5218295_PDXHCI005_Dec_pooled_input_peaks.narrowPeak.gz	Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171070/GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.gz	Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE171070/GSM5218295_PDXHCI005_Dec_pooled_input_peaks.narrowPeak.gz	Empty CSV file
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE155686/GSM4710499_KA61.FCHNKLLBBXX_L8_R1_IGAATTCGT-TAATCTTA.PE_macs2_peaks.bed.gz	Empty CSV file

CSV parse error

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310198_H69_5TGF_beta_3_SMAD3_peaks.bed.gz	CSV parse error: Expected 1 columns, got 5: chr1	778244	778741	peak_1	94.30978
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310186_TGF_beta_2_K27ac_peaks.bed.gz	CSV parse error: Expected 1 columns, got 5: chr1	10087	10345	peak_1	6.74834
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310211_CPTH6_Vehicle_3_peaks.bed.gz	CSV parse error: Expected 1 columns, got 5: chr1	10090	10234	peak_1	11.27605
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE145253/GSM4310208_CPTH6_TGF_beta_2_peaks.bed.gz	CSV parse error: Expected 1 columns, got 5: chr1	10073	10418	peak_1	18.25188
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796503_S18-EBNA2Dox10ug-H3K27ac-Rep1.bed.gz	CSV parse error: Expected 1 columns, got 13: #PeakID	chr	start	end	strand	Normalized Tag Count	region size	findPeaks Score	Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796502_S18-EBNA2Con-H3K27ac-Rep2.bed.gz	CSV parse error: Expected 1 columns, got 13: #PeakID	chr	start	end	strand	Normalized Tag Count	region size	findPeaks Score	Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796504_S18-EBNA2Dox10ug-H3K27ac-Rep2.bed.gz	CSV parse error: Expected 1 columns, got 13: #PeakID	chr	start	end	strand	Normalized Tag Count	region size	findPeaks Score	Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796501_S18-EBNA2Con-H3K27ac-Rep1.bed.gz	CSV parse error: Expected 1 columns, got 13: #PeakID	chr	start	end	strand	Normalized Tag Count	region size	findPeaks Score	Total Tags	Control ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796499_S18-Dox10ug-H3K27ac-Rep1.bed.gz	CSV parse error: Expected 1 columns, got 13: #PeakID	chr	start	end	strand	Normalized Tag Count	region size	findPeaks Score	Total Tags (normal ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158288/GSM4796500_S18-Dox10ug-H3K27ac-Rep2.bed.gz	CSV parse error: Expected 1 columns, got 13: #PeakID	chr	start	end	strand	Normalized Tag Count	region size	findPeaks Score	Total Tags	Control ...

Garbled files?

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202204_1155.replicated.broadPeak.gz	CSV parse error: Expected 1 columns, got 5: ����70�f��<���y)�]�-[_>�����k��Ѭ�S�Q0�{C�Q���e���춓���-r�|���5>�9
                                                                                               ��5|�9��v���?���	v ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202209_1224.replicated.broadPeak.gz	CSV parse error: Expected 1 columns, got 2: �;�=`���?�
                                       /�������7�r����vZ9�U����������G�{�r����S~�̻ڟ����?%���o��a��7Z�_f�5��?y���� ...
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE126573/GSM6202210_1310.replicated.broadPeak.gz	CSV parse error: Expected 2 columns, got 1: >]��:s|��r���5�1���6�)ټ�Y��t��i���}|3���|w+3��I��?O=I.�[|��Ϟ���jo[M)�mڿ�U�#0�/�F��d�(�� ...

TypeError

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104056_FL_UN_Subtel_10qMulti.bed.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104003_iPS_cR35_+3p_Subtel_10p+18p.bed.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104090_Fibroblast_pG_Subtel_5p.bed.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4103995_iPS_cR35_+44p_Subtel_10p+18p.bed.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104117_cG13-treat_Subtel_5p_OICR.bed.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138265/GSM4104011_iPS_cR35_+23p_Subtel_10qMulti.bed.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'str' object cannot be interpreted as an integer"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158348/GSM4798203_k562_cnr_erh_copy2.narrowPeak.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158348/GSM4798202_k562_cnr_erh_copy1.narrowPeak.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798204_k562_cnr_wbp11_copy1.narrowPeak.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798205_k562_cnr_wbp11_copy2.narrowPeak.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798203_k562_cnr_erh_copy2.narrowPeak.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE158350/GSM4798202_k562_cnr_erh_copy1.narrowPeak.gz	called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("'int' object cannot be converted to 'PyString'"), traceback: None }

1?

/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119516_SF11612_snATAC_peaks.bed.gz	1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119519_SF11215_peaks.bed.gz	1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119515_SF11949_snATAC_peaks.bed.gz	1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119514_SF12017_snATAC_peaks.bed.gz	1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119517_SF11979_snATAC_peaks.bed.gz	1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119520_SF11331_peaks.bed.gz	1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119513_SF11964_snATAC_peaks.bed.gz	1
/project/shefflab/brickyard/datasets_downloaded/all_geo_beds/bed_22_05_27/data/data/GSE138794/GSM4119518_SF11956_snATAC_peaks.bed.gz	1

@nleroy917
Copy link
Member

nleroy917 commented May 22, 2024

I added an extra except statement to skip PanicException:

Yeah, that's fine just be aware that can mess with things. If it helps now in the short term that's fine, but hopefully the changes can help out here

@nleroy917
Copy link
Member

nleroy917 commented May 22, 2024

#?

Issue is that the regions are stored as: chr1:100-200. I make sure to check that there are three fields after splitting on \t, otherwise we bail and raise exception.

TypeError

Issue with this is that the regions are stored with commas in the start and ends: chr10 133,785,955 133,786,173 🤦🏻‍♂️ This one also gets fixed with the new bail.

Garbled files?

These are resolved too.

CSV parse error

These contain headers in them that mess with things. Example: track name="H69_5T_-_SMAD3_-_20.FCH7Y73BBXY_L8_R1_ITTGGAGGT.PE_macs2_peaks.bed" description="H69_5T_-_SMAD3_-_20.FCH7Y73BBXY_L8_R1_ITTGGAGGT.PE_macs2_peaks.bed"

Also solved

Empty

yeah seems empty to me... 0B I see when running du -sh

Early end?

Yeah these seem empty. Solved

@nleroy917 nleroy917 added bug Something isn't working enhancement New feature or request likely solved labels May 22, 2024
@nsheff
Copy link
Member

nsheff commented May 22, 2024

I think the conclusion was to not catch these exceptions, so I wouldn't do this :).

@nleroy917
Copy link
Member

Are you talking about Claude's code?

        try:
            regions = RegionSet(file)
            tokens = tokenizer.tokenize(regions)
            tokens_file = os.path.join(tokens_dir, f"{Path(file).stem}.gtok")
            write_tokens_to_gtok(tokens_file, tokens)
        except Exception as e:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{e}\n")
        except BaseException as be:
            with open(failed_files, "a") as f:
                f.write(f"{file}\t{be}\n")

Because on the Rust side I am bubbling up the correct exceptions that should be catchable

@nleroy917
Copy link
Member

For example... using the new tokenizers yields this:

>>> from geniml.tokenization.main import TreeTokenizer
>>> t = TreeTokenizer.from_pretrained("databio/r2v-luecken2021-hg38-v2")
>>> 
>>> try:
...     t("GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.g")
... except Exception as e:
...     print(e)
... 
The file GSM5218291_PDXHCI005_Veh_pooled_input_peaks.narrowPeak.g does not exist.
>>> try:
...     t("GSM4310198_H69_5TGF_beta_3_SMAD3_peaks.bed.gz")
... except Exception as e:
...     print(e)
... 
BED file line does not have at least 3 fields: track name="..."

Which seems to indicate that we caught it correctly instead of panicking and throwing an uncatchable Exception

@nleroy917
Copy link
Member

@ClaudeHu can you confirm if this is solved or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request likely solved
Projects
None yet
Development

No branches or pull requests

3 participants