Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use unidecode, and why only on non-html content? #73

Open
intarga opened this issue Jun 13, 2024 · 2 comments
Open

Why use unidecode, and why only on non-html content? #73

intarga opened this issue Jun 13, 2024 · 2 comments

Comments

@intarga
Copy link

intarga commented Jun 13, 2024

I was trying to track down a discrepancy between latin2shaw's handling of non-html and html content, "don’t" being transliterated incorrectly as "𐑛𐑵n’t" in an html document, but correctly as "𐑛𐑴𐑯𐑑" on it's own. I realised this is because spacy doesn't understand words with U+2019 "Right single quotation mark" as contractions, but does handle the ASCII apostrophe U+0027 that unidecode converts it to. +1 to using unidecode!

To get latin2shaw to handle these correctly in an html document, I tried feeding the text on the html branch through unidecode too, but ended up with some unintended consequences... Let's take another example: "My name is Ingebjørg 😇". Without unidecode, this comes out nicely as "𐑥𐑲 𐑯𐑱𐑥 𐑦𐑟 Ingebjørg✢ 😇", but with unidecode "𐑥𐑲 𐑯𐑱𐑥 𐑦𐑟 Ingebjorg✢". Oh no... -1 to using unidecode.

I would like to come up with something that solves the problems unidecode does, without introducing its downsides, but I suspect I may be lacking the information to do this well, so I am here asking for more info. Hence:

  1. What was the original reason to introduce unidecode? Are there issues it solves other than this right single quotation and apostrophe issue I stumbled on?

  2. Is there a reason it isn't used in the html branch of latin2shaw? It seems to me that any issues it addresses ought to be addressed in both branches.

@Shavian-info
Copy link
Owner

The latin2shaw script is part of a broader suite of scripts I use locally. They are mostly hacked together for my own use and as part of learning Python. I have a separate script for cleaning up HTML before passing it to latin2shaw. My scripts aren't really worth uploading, but I've included the code below for how I clean HTML files. The script is called from a flask application, using the format http://127.0.0.1:5000/url2shaw?url= insert URL here

import requests
from urllib.parse import urlparse
import re
from unidecode import unidecode
from bs4 import BeautifulSoup


def clean_html(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(e)
        return f"{e}"

    response.encoding = 'utf-8'
    text = unidecode(response.text)

    parsed_url = urlparse(url)
    stripped_path_elements = re.split("(/)", parsed_url.path)[0:-1]
    stripped_path = ''.join([str(i) for i in stripped_path_elements])

    soup = BeautifulSoup(text, features="html.parser")
    text = soup.prettify()
    text = text.replace('''<head>''', f'''<head>\n<base href="{parsed_url.scheme}://{parsed_url.netloc}">\n<link href="static/url2shaw.css" rel="stylesheet" />''')

    def replace_src(match):
        reftype = match.group(1)
        initial_char = match.group(2)
        return f'''{reftype}="{parsed_url.scheme}://{parsed_url.netloc}/{stripped_path}{initial_char}'''

    pattern = r'''(src)="([^/])(?!ttp)'''
    text = re.sub(pattern, replace_src, text)

    text = re.sub(r'''<a([^\>]+)http([^\>]+)\>''', r'''<a \1http://127.0.0.1:5000/url2shaw?url=http\2\>''', text)
    text = re.sub(r'''<a([^\>]+)href="/([^\>]+)\>''', r'''<a\1href="http://127.0.0.1:5000/url2shaw?url=''' + parsed_url.scheme + r'''://''' + parsed_url.netloc + r'''/\2\>''', text)

    return text

@intarga
Copy link
Author

intarga commented Jun 19, 2024

I see, so you do use unidecode on both.

Since you don't mention any other reasons for using unidecode, and I haven't encountered any apart from the one I mentioned, I'm going to assume it's the only thing unidecode is needed for. In that case I think this simple regex on the latin text solves the problem better:

text_part = re.sub(r"\b’\b", "\'", text_part)

This preserves non-ASCII characters, and removes the need for unidecode, smartypants, and the beautifulsoup parse, saving us a bunch of code and 2 dependencies (not 3 because I think we should keep beautiful soup for another reason - simplifying the html processing).

Once the packaging PR gets merged, I'll post a PR for this with relevant test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants