Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expression NER #118

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from
Open

Conversation

francolq
Copy link
Contributor

try:
m = next(i)
start, end = m.span()
# FIXME: do not count from the beggining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do with this FIXME note? Is it already fixed and you just forgot to remove the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad. This is not a FIXME, it is at most a TODO, because a small optimization can be done here. I think the comment can be safely removed.

# preprocess the regular expression
regexp = re.sub(r'\s', '', regexp)
# replace < and > only if not double (<< or >>):
# FIXME: avoid matching \< and \>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question re FIXME here.
Can't be solved before merging into develop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost same answer here. My bad to call this a FIXME. It would be an enhancement to allow escaping '<' and '>'. The comment can be removed.


def __init__(self, tokens):
# replace < and > inside tokens with \< and \>
_raw = '><'.join(w.replace('<', '\<').replace('>', '\>') for w in tokens)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no completely sure, but would it be a problem if there is a token with \< (or \>) inside it?

def run_ner(self, doc):
entities = []
tokens = doc.tokens
searcher = TokenSearcher(tokens)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenSearcher is implemented as a class but it is stateless in a practical sense and it is used more like a function than a class.

import re
import codecs

from nltk.text import TokenSearcher as NLTKTokenSearcher
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import looks unused since (apparently) no method from NLTKTokenSearcher is used.

token_start = self._raw[:start].count('><')
token_end = self._raw[:end].count('><')
yield MatchObject(m, token_start, token_end)
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try...except is dangerous because it silently hides any error that can happen inside the loop.

Why not to use for m in i: instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants