-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deanonymize(anonymize(text)) != text #1151
Comments
Hi @zizhong, thanks for reporting this. Would you mind adding the analyzer and anonymizer full results? |
@omri374 My pleasure! Original text:
sanitized_results:
items:
desanitized_results:
items:
Result:
|
To add on to this, I'm running into the following error File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/analyzer_engine.py", line 189, in analyze
nlp_artifacts = self.nlp_engine.process_text(text, language)
File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/nlp_engine/spacy_nlp_engine.py", line 44, in process_text
doc = self.nlp[language](text)
File "/home/os/.venv/lib/python3.10/site-packages/spacy/language.py", line 1047, in __call__
error_handler(name, proc, [doc], e)
File "/home/os/.venv/lib/python3.10/site-packages/spacy/util.py", line 1724, in raise_error
raise e
File "/home/os/.venv/lib/python3.10/site-packages/spacy/language.py", line 1042, in __call__
doc = proc(doc, **component_cfg.get(name, {})) # type: ignore[call-arg]
File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/nlp_engine/transformers_nlp_engine.py", line 71, in __call__
doc.ents = ents
File "spacy/tokens/doc.pyx", line 796, in spacy.tokens.doc.Doc.ents.__set__
File "spacy/tokens/doc.pyx", line 833, in spacy.tokens.doc.Doc.set_ents
ValueError: [E1010] Unable to set entity information for token 28 which is included in more than one span in entities, blocked, missing or outside. With the following code sample import transformers
from huggingface_hub import snapshot_download
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
transformers_model = "obi/deid_roberta_i2b2"
snapshot_download(repo_id=transformers_model)
# Instantiate to make sure it's downloaded during installation and not runtime
transformers.AutoTokenizer.from_pretrained(transformers_model)
transformers.AutoModelForTokenClassification.from_pretrained(transformers_model)
# Create configuration containing engine name and models
configuration = {
"nlp_engine_name": "transformers",
"models": [
{
"lang_code": "en",
"model_name": {
"spacy": "en_core_web_sm",
"transformers": transformers_model,
},
}
],
}
# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])
# Initialize the anonymizer and deanonymizer engines
# Possibly put these into a server to avoid reinitialization
anonymizer = AnonymizerEngine()
deanonymizer = DeanonymizeEngine()
text = """
During our recent meeting on February 23, 2023, at 10:30 AM, John Doe provided
me with his personal details. His email is [email protected] and his contact
number is 650-456-7890. He lives in New York City, USA, and belongs to the
American nationality with Christian beliefs and a leaning towards the Democratic party.
He mentioned that he recently made a transaction using his credit card 4111 1111 1111 1111
and transferred bitcoins to the wallet address 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa.
While discussing his European travels, he noted down his IBAN as GB29 NWBK 6016 1331 9268 19.
Additionally, he provided his website as https://johndoeportfolio.com. John also discussed some
of his US-specific details. He said his bank account number is 1234567890123456 and his drivers license
is Y12345678. His ITIN is 987-65-4321, and he recently renewed his passport, the number for
which is 123456789. He emphasized not to share his SSN, which is 669-45-6789.
Furthermore, he mentioned that he accesses his work files remotely through the IP 192.168.1.1
and has a medical license number MED-123456.
"""
analysis_results = analyzer.analyze(text=text, language="en") I believe this should be related |
Hi @octaviansima, we are aware of this issue. Until we fix it (WIP), it is recommended to use the TransformersRecognizer approach and not the |
Hi @zizhong, I did an attempt to reproduce this but wasn't able to. Steps I've taken:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
import spacy
model_path = "obi/deid_roberta_i2b2"
supported_entities = BERT_DEID_CONFIGURATION.get(
"PRESIDIO_SUPPORTED_ENTITIES")
transformers_recognizer = TransformersRecognizer(model_path=model_path,
supported_entities=supported_entities)
# This would download a large (~500Mb) model on the first run
transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)
# Add transformers model to the registry
registry = RecognizerRegistry()
registry.add_recognizer(transformers_recognizer)
registry.remove_recognizer("SpacyRecognizer")
# Use small spacy model, for faster inference.
if not spacy.util.is_package("en_core_web_sm"):
spacy.cli.download("en_core_web_sm")
nlp_configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}
nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()
analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine)
results = analyzer.analyze(text, language="en",
return_decision_process=True) Where text = the text you provided from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorResult, OperatorConfig
from presidio_anonymizer.operators import Decrypt
key="16charEncryptKey16charEncryptKey"
engine = AnonymizerEngine()
# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer)
# and an 'encrypt' operator to get an encrypted anonymization output:
anonymize_result = engine.anonymize(
text=text,
analyzer_results=results,
operators={"DEFAULT": OperatorConfig("encrypt", {"key": key})},
)
# Fetch the anonymized text from the result.
anonymized_text = anonymize_result.text
# Fetch the anonynized entities from the result.
anonymized_entities = anonymize_result.items
# Initialize the engine:
engine = DeanonymizeEngine()
# Invoke the deanonymize function with the text, anonymizer results
# and a 'decrypt' operator to get the original text as output.
deanonymized_result = engine.deanonymize(
text=anonymized_text,
entities=anonymized_entities,
operators={"DEFAULT": OperatorConfig("decrypt", {"key": key})},
)
deanonymized_result.text
We had a few contributions to the |
Thanks! |
Thanks! if you let us know what the issue was, that would be very helpful! |
@omri374 sure thing. I think that is intended for the use case of only anonymize() used. However it becomes a problem if deanonymize() is applied. |
Describe the bug
deanonymize(anonymize(text)) != text
To Reproduce
Steps to reproduce the behavior:
obi/deid_roberta_i2b2
as analyzera medical license number MED-123456
a medical license number <ORGANIZATION><ID><US_DRIVER_LICENSE>.
The <item> is the base64 encoded encrypted item.a medical license number MED-123123456
Expected behavior
deanonymize(anonymize(text)) == text
The text was updated successfully, but these errors were encountered: