Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add manual correction of tokenization in BIO format #9

Open
shigapov opened this issue Aug 12, 2022 · 2 comments
Open

Add manual correction of tokenization in BIO format #9

shigapov opened this issue Aug 12, 2022 · 2 comments

Comments

@shigapov
Copy link

An idea for enhancement: adding a new tab "Tokenization" or "Tokens". In this tab a user could upload a file in IOB2/BIO format and manually correct tokenization and tags. For example, a user could split or merge the tokens and modify the corresponding tags.

Motivation: the current tokenizer in MedTator for BIO export has relatively low accuracy for many special cases. In those cases the BIO files cannot be currently used for training. But even if the tokenizer will be replaced by another one (as mentioned in #7), it is unlikely to have good performance for all languages and use cases. Therefore having an opportunity to fix manually the tokens and tags in BIO format would be very helpful for building a gold standard corpus.

@hehuan2112
Copy link
Collaborator

Thank you so much for your suggestion! I think it's a very nice feature for improving the dataset quality!

My understanding is that your idea is about revising the file content of the IOB2/BIO format data, such as split/merge/delete tokens and tags, if there are any errors in tokenization or tagging. If so, I think the most challenging part is how to locate the errors in the long IOB-format file. As you described, the errors can be caused by exceptional tokenization cases, and we are not sure where they are in the file, which needs manual correction.

The editing, such as split/merge/delete tokens, can be done easily in any text editor (VSCode, Sublime Text, etc.).
I agree that a tool can be more helpful in making both searching and editing easier. Could you share some cases related to fixing issues in IOB2/BIO format files? For example, how to locate the errors and what the changes to the files. Then I think we can check how to improve this process.

I found one tool, neat, which may have similar functions for IOB2/BIO file editing. But I'm not sure if there is any other tool for reference. If you have any thoughts, please feel free to discuss :)

@shigapov
Copy link
Author

Indeed, neat looks like a good tool for tokenization correction. I'll test it.

With respect to locating errors and changes to the file: I tested MedTator on a text with many special characters and then converted it to BIO format, the special characters were considered as separate tokens. In this case I need to merge many tokens, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants