Add manual correction of tokenization in BIO format #9

shigapov · 2022-08-12T15:13:37Z

An idea for enhancement: adding a new tab "Tokenization" or "Tokens". In this tab a user could upload a file in IOB2/BIO format and manually correct tokenization and tags. For example, a user could split or merge the tokens and modify the corresponding tags.

Motivation: the current tokenizer in MedTator for BIO export has relatively low accuracy for many special cases. In those cases the BIO files cannot be currently used for training. But even if the tokenizer will be replaced by another one (as mentioned in #7), it is unlikely to have good performance for all languages and use cases. Therefore having an opportunity to fix manually the tokens and tags in BIO format would be very helpful for building a gold standard corpus.

hehuan2112 · 2022-08-18T22:36:14Z

Thank you so much for your suggestion! I think it's a very nice feature for improving the dataset quality!

My understanding is that your idea is about revising the file content of the IOB2/BIO format data, such as split/merge/delete tokens and tags, if there are any errors in tokenization or tagging. If so, I think the most challenging part is how to locate the errors in the long IOB-format file. As you described, the errors can be caused by exceptional tokenization cases, and we are not sure where they are in the file, which needs manual correction.

The editing, such as split/merge/delete tokens, can be done easily in any text editor (VSCode, Sublime Text, etc.).
I agree that a tool can be more helpful in making both searching and editing easier. Could you share some cases related to fixing issues in IOB2/BIO format files? For example, how to locate the errors and what the changes to the files. Then I think we can check how to improve this process.

I found one tool, neat, which may have similar functions for IOB2/BIO file editing. But I'm not sure if there is any other tool for reference. If you have any thoughts, please feel free to discuss :)

shigapov · 2022-08-19T05:51:20Z

Indeed, neat looks like a good tool for tokenization correction. I'll test it.

With respect to locating errors and changes to the file: I tested MedTator on a text with many special characters and then converted it to BIO format, the special characters were considered as separate tokens. In this case I need to merge many tokens, for example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add manual correction of tokenization in BIO format #9

Add manual correction of tokenization in BIO format #9

shigapov commented Aug 12, 2022

hehuan2112 commented Aug 18, 2022

shigapov commented Aug 19, 2022

Add manual correction of tokenization in BIO format #9

Add manual correction of tokenization in BIO format #9

Comments

shigapov commented Aug 12, 2022

hehuan2112 commented Aug 18, 2022

shigapov commented Aug 19, 2022