Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API request: String Homoglyph.toASCII(String) #6

Open
am11 opened this issue Sep 9, 2019 · 3 comments
Open

API request: String Homoglyph.toASCII(String) #6

am11 opened this issue Sep 9, 2019 · 3 comments

Comments

@am11
Copy link

am11 commented Sep 9, 2019

Please provide a toASCII API which tries to fit the character in ASCII range and returns a string. For example, the following holds true:

Homoglyph homoglyph = HomoglyphBuilder.build();
assertEquals("The quick brown fox jumps over the lazy dog", 
    homoglyph.toASCII("Τһе ԛυіϲκ Ьгоѡɴ ғох јυⅿрѕ оⅴег τһе ⅼаzу ԁоɡ"));

It is useful in the scenarios where we want to run complex REGEX rules on (approximate) ASCII representation. Building complex regex tree equivalent with Homoglyph.search() API is not convenient (at least in certain cases).

@codygray
Copy link

This would be extremely useful to me. I was looking for a canonicalize function that would do essentially the same thing. I think that, perhaps, canonicalize is a better name than toASCII, since the "base" characters may not be strictly ASCII.

@codebox
Copy link
Owner

codebox commented Sep 11, 2022

This change is probably more complicated than it first appears. I don't think a single 'canonical' set of characters could be defined that would make sense for everyone, it would vary depending on the language of the user, and also on the expected content of the text (for example should the digit '1' be replaced with the letter 'l' or left as it is?) I think the library would have to allow the user to specify what they considered canonical. In addition it isn't obvious what the correct behaviour should be if letters within the canonical set are homoglyphs of each other - for example if we just say that the 26 letters of the English alphabet are canonical, do we change the digit 1 to lower-case 'L' or to capital 'I'?

I welcome any suggestions regarding a good way to handle this.

@gdude2002
Copy link

gdude2002 commented Mar 14, 2023

Running into this - we'd find it very useful to be able to regex-match including homoglyphs, and normalisation is definitely the only way to handle this.

My suggestion would be to prioritise normalising to letters - the main use-case for a library like this is automated chat moderation; it's unlikely for numbers to be useful matches for problematic content (in my opinion).

Of course, this doesn't solve the latter part of your question - I think the only real solution there is to support generating permutations instead; then they can all be tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants