Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception when clean=True in search_for_connected_sentences #91

Open
balazik opened this issue Feb 16, 2021 · 1 comment
Open

Exception when clean=True in search_for_connected_sentences #91

balazik opened this issue Feb 16, 2021 · 1 comment

Comments

@balazik
Copy link

balazik commented Feb 16, 2021

Describe the bug
Segmenter will raise "exception: bad escape (end of pattern) at position" when it is initialized with clean=True and it encounters a sentence like "etc.Png,Jpg,.\" (word/token that contains a backslash).

The exception is raised in:
module:
cleaner.py
class:
class Cleaner
method name:
search_for_connected_sentences
line:

txt = re.sub(re.escape(word), new_word, txt)

To Reproduce
Steps to reproduce the behavior:

# This is a simplified example, the original text contained names so I changed it to img formats
# Word that is a abbreviation with dot followed by upper case letter and backslash
sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\"
sentences = sentencer.segment(txt)

Expected behavior
The output should be the same as is, but is should not trow an exception.
Workaround to see the output is to escape the backslash.

sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\\\"
sentences = sentencer.segment(txt)

Expected output:

['etc.', 'Png,Jpg,.', '\\']

Possible solution
replace txt = re.sub(re.escape(word), new_word, txt)
with txt = txt.replace(word, new_word)
It avoids all the pitfalls of regular expressions (like escaping), and is generally faster.

Additional context
Originally we parse small text files (in Slovak language) without special treatment to form a huge sentenced corpus. The example was specially crafted just to reproduce the behavior for English parser. I know that the backslash combination is rare for English but it happens to occur in Slovak articles when you process vast amounts of text.

@kevmurray
Copy link

kevmurray commented May 6, 2021

Additional Case:

Also ran into this in spanish text with the string 1.C\ ... assume it is the same problem:

re.error: bad escape (end of pattern) at position 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants