Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kRSUnicode bug #315

Closed
tony opened this issue Feb 26, 2024 · 7 comments
Closed

kRSUnicode bug #315

tony opened this issue Feb 26, 2024 · 7 comments
Labels

Comments

@tony
Copy link
Member

tony commented Feb 26, 2024

That was a really fast response :D

This is actually my bad; the latest unihan_etl already has a fix for this in place, and I mistakenly thought I had updated.

The issue is a typo in the kRSUnicode field for 亀: https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E4%BA%80. It has two apostrophes, which does not follow the syntax specified in the standard. unihan_etl has already updated its parsing to allow the second apostrophe.

I did have to update my code for some unihan_etl changes, but nothing crazy.

See also: #233 (comment)

@tony tony added the bug label Feb 26, 2024
@tony
Copy link
Member Author

tony commented Feb 26, 2024

This is fixed by the latest unihan_etl release (we're around 0.33.1 now)

@tony tony closed this as completed Feb 26, 2024
@garfieldnate
Copy link

@tony Just a heads-up that you'll likely need to handle 3 apostrophes here when Unicode 16.0 comes out. Details on why here: https://www.unicode.org/review/pri483/feedback.html#ID20240328172102

@tony
Copy link
Member Author

tony commented Apr 1, 2024

@garfieldnate Thank you for catching that! Adding #318 for supporting this.

I will keep an eye on https://www.unicode.org/reports/tr38/proposed.html as well (I suppose that's when we know for sure it'll be 3 apostrophes?)

@garfieldnate
Copy link

I believe so, yes. I didn't actually catch this change myself; Ken Lunde wrote me to let me know, which was very kind :D

@garfieldnate
Copy link

garfieldnate commented Apr 30, 2024

@tony Ken Lunde tells me that the change has been accepted by the group, and he has updated the docs here: https://www.unicode.org/reports/tr38/proposed.html#N101E4. The 16.0 beta will be out with this change on 5/21.

@tony
Copy link
Member Author

tony commented May 6, 2024

@garfieldnate Thank you

Aside:

  • Is their any way to access the proposed data / Unihan.zip file now? Or does that have to wait until beta is released on 5/21?
    • If not, any examples of what those new lines look like so I can add them to our tests?

@garfieldnate
Copy link

No problem!

The note from Ken Lunde:

When the Unicode Version 16.0 Beta review begins on 05/20, it will include an updated Unihan database. For this particular syntax change, only the following two characters (in Extension H) will be affected:

U+31DE5 kRSUnicode 117.6 212'''.0
U+31E22 kRSUnicode 118.11 212'''.6

When Extension J is added, which will probably be in Version 17.0 (2025), four additional characters will include three apostrophes after the radical number (212).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants