Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialization fails on invalid unicode code point #2259

Open
cmanallen opened this issue Jan 29, 2024 · 0 comments
Open

Deserialization fails on invalid unicode code point #2259

cmanallen opened this issue Jan 29, 2024 · 0 comments

Comments

@cmanallen
Copy link

Version python-rapidjson==1.14.
To reproduce: import rapidjson; rapidjson.loads('"\ud83c"')
Error message: UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 1: surrogates not allowed

\ud83c is not a valid unicode code point. Currently deserialization fails. This is uncommon behavior compared to other JSON parsers which deserialize it as an ASCII literal.

Consider the default Python JSON parser which returns the following given a valid and invalid unicode code point.

>>> json.loads('"\u266a"')
'♪'
>>> json.loads('"\ud83c"')
'\ud83c'

As opposed to rapidjson which returns:

>>> rapidjson.loads('"\u266a"')
'♪'
>>> rapidjson.loads('"\ud83c"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 1: surrogates not allowed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant