Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some spans are being skipped by spacy-huggingface-pipelines, resulting in poor anonymisation #1262

Open
aayushisanghi opened this issue Jan 23, 2024 · 20 comments

Comments

@aayushisanghi
Copy link

Hi! I've been working on a transformer-based Presidio pipeline, and I noticed it was performing rather poorly. Upon inspecting the logs, I found this particular warning:
UserWarning: Skipping annotation, {'entity_group':'PASSWORD', 'score': 0.25415105, 'word': '##bmh78', 'start': 157, "end': 1623} is overlapping or can't be aligned for doc 'Standardized tests will be...'

The root cause of this issue is this line in the spacy-huggingface-pipelines package. I know this isn't directly Presidio related, but is there any configuration change I can make in Presidio, to prevent these spans from being skipped? Or is there another issue I'm not seeing here?

I followed the online tutorial, and am using a publicly available dataset to test different models. I'm not sure why this could be happening.

Any help to debug this will be super helpful, thanks!

@VMD7
Copy link
Contributor

VMD7 commented Jan 24, 2024

Hi @aayushisanghi
Could you please recreate the mentioned scenario and share me the example with code.

@omri374
Copy link
Contributor

omri374 commented Jan 28, 2024

@aayushisanghi, as @VMD7 mentioned, a reproducible example would definitely help. Thanks!

@omri374
Copy link
Contributor

omri374 commented Feb 11, 2024

@aayushisanghi, we'd be very interested to know more about this issue especially as the result is poor anonymization. Any feedback would be valuable.

@thomas-moulin
Copy link

thomas-moulin commented Feb 13, 2024

Hello! I am having the same issue.
I am using a specific transformer model to be able to detect PII on french text.
With this approach it seems to be able to detect the PII but for some reason it only output it in a warning log ...

Capture d’écran 2024-02-13 à 11 41 55

Here is my code:

import transformers
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer, AutoModelForTokenClassification
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider


transformers_model = "Jean-Baptiste/camembert-ner-with-dates"
snapshot_download(repo_id=transformers_model)

AutoTokenizer.from_pretrained(transformers_model)
AutoModelForTokenClassification.from_pretrained(transformers_model)

conf_file = "/Users/thomasmoulin/Downloads/config_presidio_fr_transformer.yml"

provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()

analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, 
    supported_languages=["fr"]
)

result = analyzer.analyze(text="Je m'appelle Thomas Moulin", language="fr")

@omri374
Copy link
Contributor

omri374 commented Feb 13, 2024

Thanks @thomas-moulin. Would you mind sharing your conf_file? or is it standard?

@thomas-moulin
Copy link

thomas-moulin commented Feb 13, 2024

Thanks for the quick reply !

sure !

nlp_engine_name: transformers
models:

lang_code: fr
model_name:
spacy: fr_core_news_sm
transformers: Jean-Baptiste/camembert-ner-with-dates

ner_model_configuration:
labels_to_ignore:

  • O
    aggregation_strategy: simple # "simple", "first", "average", "max"
    stride: 16
    alignment_mode: strict # "strict", "contract", "expand"
    model_to_presidio_entity_mapping:
    PER: PERSON
    LOC: LOCATION

low_confidence_score_multiplier: 0.4

@omri374
Copy link
Contributor

omri374 commented Feb 13, 2024

@thomas-moulin is warning the only thing the gets outputted? I tried to reproduce this and got [type: PERSON, start: 13, end: 26, score: 0.992917537689209]

@thomas-moulin
Copy link

Yes in my case I only have the warning that gets outputted. the result variable is an empty list

@thomas-moulin
Copy link

Capture d’écran 2024-02-13 à 15 54 18

@omri374
Copy link
Contributor

omri374 commented Feb 13, 2024

If you change alignment_mode: strict to alignment_mode: expand, does it change the outcome?

@thomas-moulin
Copy link

yes @omri374 it works!
Thank you very much for your help :)

@omri374
Copy link
Contributor

omri374 commented Feb 13, 2024

Great. Leaving the issue open as there still could be corner cases where there's a wrong output.

@thomas-moulin
Copy link

Yes ! On longer inputs (OCR on french resume) I still have some warnings but not as many as before

@omri374
Copy link
Contributor

omri374 commented Feb 13, 2024

Warnings are inevitable (it's part of spacy-huggingface-pipelines) but I'd be interested to see if there are missing predictions.

@fml09
Copy link

fml09 commented Mar 19, 2024

Hello @omri374, @VMD7 . I am currently experiencing the same issue. Below is the code that can reproduce the problem.

from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_analyzer.nlp_engine import NerModelConfiguration, TransformersNlpEngine

model_config = [
    {
        "lang_code": "en",
        "model_name": {
            "spacy": "en_core_web_sm",
            "transformers": "lakshyakh93/deberta_finetuned_pii",
        },
    }
]

mapping = dict(
    USERNAME="USERNAME",
    EMAIL="EMAIL",
    KEY="KEY",
    PASSWORD="PASSWORD",
    IP_ADDRESS="IP_ADDRESS",
    FIRSTNAME="FIRSTNAME",
    LASTNAME="LASTNAME",
    MIDDLENAME="MIDDLENAME",
    IPV4="IP_ADDRESS",
    IPV6="IP_ADDRESS",
    IP="IP_ADDRESS",
    PHONE_NUMBER="PHONE_NUMBER",
    SSN="SSN",
    ACCOUNTNUMBER="ACCOUNTNUMBER",
    CREDITCARDNUMBER="CREDITCARDNUMBER",
    CREDITCARDISSUER="CREDITCARDISSUER",
    CREDITCARDCVV="CREDITCARDCVV",
)
ner_model_configuration = NerModelConfiguration(
    model_to_presidio_entity_mapping=mapping,
)
nlp_engine = TransformersNlpEngine(models=model_config, ner_model_configuration=ner_model_configuration)
engine = AnalyzerEngine(
    nlp_engine=nlp_engine,
    supported_languages=[
        "en",
    ],
)


print(
    engine.analyze(
        "My name is Clara and I live in Berkeley. this is my ip address : 175.5.0.1. this is my password: sad$f-j?ss11FF. credit card is 1231-1231-1451-2134",
        language="en",
    )
)
error: 
/spacy_huggingface_pipelines/token_classification.py:129: UserWarning: Skipping annotation, {'entity_group': 'CREDITCARDNUMBER', 'score': 0.9902746, 'word': '31-1231-1451-2134', 'start': 130, 'end': 147} is overlapping or can't be aligned for doc 'My name is Clara and I live in Berkeley. this is my ip address : 175.5.0.1. this is my password: sad...'
  warnings.warn(

It seems like there might be an issue with using spacy and transformers together.

Related Issue: explosion/spaCy#12998

@omri374
Copy link
Contributor

omri374 commented Mar 19, 2024

@fml09 do you experience skipped entities or just warnings?

@fml09
Copy link

fml09 commented Mar 19, 2024

@omri374
Both of them. It skips entities and makes warning messages as well.

Result:

[type: PASSWORD, start: 97, end: 111, score: 0.9996715188026428, type: CREDITCARDNUMBER, start: 128, end: 132, score: 0.9856985211372375, type: IP_ADDRESS, start: 65, end: 74, score: 0.95, type: FIRSTNAME, start: 11, end: 16, score: 0.9181937575340271, type: IN_PAN, start: 101, end: 111, score: 0.05]

@fml09
Copy link

fml09 commented Mar 23, 2024

@omri374 any news?

@omri374
Copy link
Contributor

omri374 commented Mar 23, 2024

Looking into this. If we can't find a resolution, we will likely remove the dependency on spacy-huggingface-pipelines and call transformers directly.

@omri374
Copy link
Contributor

omri374 commented Mar 23, 2024

For your specific case, please try changing the aggregation strategy to max:

ner_model_configuration = NerModelConfiguration(
    model_to_presidio_entity_mapping=mapping,
    aggregation_strategy="max"
)

It will result in the credit card number to be fully identified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants