Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescape JSON Unicode sequences #1175

Open
3 tasks done
diamondsw opened this issue Jun 19, 2024 · 1 comment · May be fixed by #1087
Open
3 tasks done

Unescape JSON Unicode sequences #1175

diamondsw opened this issue Jun 19, 2024 · 1 comment · May be fixed by #1087
Labels
enhancement New feature or request triage

Comments

@diamondsw
Copy link

What type of request is this?

New tool idea

Clear and concise description of the feature you are proposing

Convert between unicode symbols (💯) and JSON escape sequences (\uD83D\uDCAF).

Is their example of this tool in the wild?

https://magictool.ai/tool/unicode-decoder-encoder/

Additional context

Honestly not sure if this should be a separate tool, added to "Text to Unicode", be made part of a JSON tool, etc, since it touches so many concepts (Unicode, JSON, escaping, encoding, etc).

Validations

  • Check the feature is not already implemented in the project.
  • Check that there isn't already an issue that request the same feature to avoid creating a duplicate.
  • Check that the feature can be implemented in a client side only app (IT-Tools is client side only, no server).
@lionel-rowe
Copy link

lionel-rowe commented Jun 20, 2024

My PR #1087 will fix this if merged — it adds more conversion options to the text-to-unicode tool. One of those options is UTF-16 Code Units, of which 💯\uD83D\uDCAF is an example.

{
text: 'a 💩 b',
skipAscii: true,
results: {
// eslint-disable-next-line unicorn/escape-case
fullUnicode: String.raw`a \u{1f4a9} b`,
// eslint-disable-next-line unicorn/escape-case
utf16: String.raw`a \ud83d\udca9 b`,
hexEntities: String.raw`a 💩 b`,
decimalEntities: String.raw`a 💩 b`,
},
},

Note that it isn't accurate to call this "JSON Unicode", as the best* way of encoding 💯 in JSON is as a literal 💯 Unicode character. For example, JavaScript's JSON.stringify just uses the literal characters where possible, and JSON.parse accepts them with no problem:

JSON.stringify({ '💯': '💯' })
// result: '{"💯":"💯"}'

JSON.parse('{"💯":"💯"}')
// result: { '💯': '💯' }

Meanwhile, encoding it as a surrogate pair may yield correct results when parsed:

JSON.parse(String.raw`{"\uD83D\uDCAF":"\uD83D\uDCAF"}`)
// result: { '💯': '💯' }

But the JSON spec doesn't require that parsers convert the surrogate pair to its corresponding code point, so it's not guaranteed that even fully spec-compliant parsers will "do the right thing" here.

*For most use cases. If you can't guarantee that the consumer will use UTF-8 as required by RFC 8259, but you can guarantee that the consumer will handle surrogate pairs correctly, then it'd make sense to use the surrogate pair. But that seems like a pretty niche situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants