-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why use unidecode, and why only on non-html content? #73
Comments
The latin2shaw script is part of a broader suite of scripts I use locally. They are mostly hacked together for my own use and as part of learning Python. I have a separate script for cleaning up HTML before passing it to latin2shaw. My scripts aren't really worth uploading, but I've included the code below for how I clean HTML files. The script is called from a flask application, using the format
|
I see, so you do use unidecode on both. Since you don't mention any other reasons for using unidecode, and I haven't encountered any apart from the one I mentioned, I'm going to assume it's the only thing unidecode is needed for. In that case I think this simple regex on the latin text solves the problem better:
This preserves non-ASCII characters, and removes the need for unidecode, smartypants, and the beautifulsoup parse, saving us a bunch of code and 2 dependencies (not 3 because I think we should keep beautiful soup for another reason - simplifying the html processing). Once the packaging PR gets merged, I'll post a PR for this with relevant test cases. |
I was trying to track down a discrepancy between latin2shaw's handling of non-html and html content, "don’t" being transliterated incorrectly as "𐑛𐑵n’t" in an html document, but correctly as "𐑛𐑴𐑯𐑑" on it's own. I realised this is because spacy doesn't understand words with U+2019 "Right single quotation mark" as contractions, but does handle the ASCII apostrophe U+0027 that unidecode converts it to. +1 to using unidecode!
To get latin2shaw to handle these correctly in an html document, I tried feeding the text on the html branch through unidecode too, but ended up with some unintended consequences... Let's take another example: "My name is Ingebjørg 😇". Without unidecode, this comes out nicely as "𐑥𐑲 𐑯𐑱𐑥 𐑦𐑟 Ingebjørg✢ 😇", but with unidecode "𐑥𐑲 𐑯𐑱𐑥 𐑦𐑟 Ingebjorg✢". Oh no... -1 to using unidecode.
I would like to come up with something that solves the problems unidecode does, without introducing its downsides, but I suspect I may be lacking the information to do this well, so I am here asking for more info. Hence:
What was the original reason to introduce unidecode? Are there issues it solves other than this right single quotation and apostrophe issue I stumbled on?
Is there a reason it isn't used in the html branch of latin2shaw? It seems to me that any issues it addresses ought to be addressed in both branches.
The text was updated successfully, but these errors were encountered: