Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop? #79

Open
matthen opened this issue Oct 23, 2020 · 2 comments
Open

Infinite loop? #79

matthen opened this issue Oct 23, 2020 · 2 comments
Labels
bug edge-cases update rules to account for the edge cases

Comments

@matthen
Copy link
Contributor

matthen commented Oct 23, 2020

The below seems to hang forever-

segmenter = pysbd.Segmenter(language="en", clean=False)
text = "..[111 111 111 111 111 111 111 111 111 111]"
segmenter.segment(text)

Interrupting I get the traceback:

Traceback (most recent call last):
  File "check.py", line 5, in <module>
    segmenter.segment(text)
  File ".../python3.7/site-packages/pysbd/segmenter.py", line 87, in segment
    postprocessed_sents = self.processor(text).process()
  File ".../python3.7/site-packages/pysbd/processor.py", line 37, in process
    self.replace_periods_before_numeric_references()
  File ".../python3.7/site-packages/pysbd/processor.py", line 141, in replace_periods_before_numeric_references
    r"∯\2\r\7", self.text)
  File ".../python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
KeyboardInterrupt

this is pysbd version 0.3.3, python 3.7.7

Could it be entering into an infinite loop?

(I found this bug by applying pysbd to wikipedia, on this article: https://en.wikipedia.org/wiki/Clojure it tripped up on "...[484 216 622 139 651 592 379 228 242 355]"

@nipunsadvilkar nipunsadvilkar added the edge-cases update rules to account for the edge cases label Feb 11, 2021
@nipunsadvilkar
Copy link
Owner

It's due to Catastrophic backtracking in NUMBERED_REFERENCE_REGEX. Need to dug into details

@ajar19
Copy link

ajar19 commented Sep 4, 2022

HI @nipunsadvilkar , We faced the same issue with another text.
text = ......[289852000000260698,289852000000260744

Any update on this, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug edge-cases update rules to account for the edge cases
Projects
None yet
Development

No branches or pull requests

3 participants