Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing the decision process text when working with images #1361

Open
NuiMrme opened this issue Apr 22, 2024 · 8 comments
Open

Enhancing the decision process text when working with images #1361

NuiMrme opened this issue Apr 22, 2024 · 8 comments

Comments

@NuiMrme
Copy link

NuiMrme commented Apr 22, 2024

Is your feature request related to a problem? Please describe.
The decision process output prints out the entity_type, start_position, end_position and the score. When working with longer sequences of texts or with images, printing start = 204 end = 217 doesn't really mean anything and it is hard to see where that is.

Describe the solution you'd like
Add an entity_text where the the text in question is also printed: printing start = 204 end = 217 entity_text = "Saint Antonio"

I solved this on my version by adding

entity_text: str,

in recognizer_result.py init function which then affected also image_analzer_engine.py, image_recognizer_results.py, spacy_recognizer.py and pattern_recognizer.py
but the output is rather more readable

@NuiMrme
Copy link
Author

NuiMrme commented Apr 23, 2024

While at it, in analyzer_engine.py line:222 I modified the line so that the code prints out every case in a new line , even more readable
json.dumps([str(result.to_dict()) for result in results], indent=2),

@omri374
Copy link
Contributor

omri374 commented Apr 23, 2024

@NuiMrme are you asking specifically for images, or for any text?

@omri374
Copy link
Contributor

omri374 commented Apr 23, 2024

Does this help? #925 (comment)

@NuiMrme
Copy link
Author

NuiMrme commented Apr 23, 2024

Sorry that wasn't well explained. I'm not reporting a bug but rather a feature I implemented on my version of Presidio that might help others too. See when you work with images or a lot of text while having your log_decision_process=True , the printed text will be for many many instances where it detected something and the log becomes unreadable. Please remember it prints that automatically no explicit print command is used as in your shared comment above.
If I have one line example thats fine I can look quickly see what these position refer to but when you have many of these stacked together because it is coming from an image of a document , you don't know anymore what is what. So I did the above mentioned modifications to change it a bit to make it more readable

Every new case will begin in a new line and observe that there is now a 'entity_text' which will show that text that is detected (I covered it with red for the obvious reasons), now you don't have to guess what line was that in the image what position etc... This is more readable and help the anlaysis of the annomyization results.

before
image

after
image

@omri374
Copy link
Contributor

omri374 commented Apr 24, 2024

One of the reasons we intentionally left out the actual identified text, is because it is essentially PII you might not want to log or return. If you have a suggestion on how to allow this, perhaps not asa default setting, we'd be happy to hear.

I totally agree that there are cases, especially with the images module, where returning or logging the actual text is needed.

@NuiMrme
Copy link
Author

NuiMrme commented Apr 25, 2024

One of the reasons we intentionally left out the actual identified text, is because it is essentially PII you might not want to log or return. If you have a suggestion on how to allow this, perhaps not asa default setting, we'd be happy to hear.

I totally agree that there are cases, especially with the images module, where returning or logging the actual text is needed.

Well they are already printed out in the beginning anyway
[2024-04-24 12:46:05,853][decision_process][INFO][None][nlp artifacts:{"entities": ["Travaux", "Forage D'Eau Du", ...

@omri374
Copy link
Contributor

omri374 commented Apr 25, 2024

Good catch. I guess that for return_decision_process=True, it makes sense to be more verbose and return the actual values, but for the production version (where return_decision_process is likely disabled), it makes sense to omit it. Would you be interested in proposing a change through a pull request?

@NuiMrme
Copy link
Author

NuiMrme commented Apr 25, 2024

Absolutely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants