Replies: 2 comments 3 replies
-
Hi @Asterix45, That is a surprisingly hard problem ;). Text extraction from general PDFs is a whole domain in itself, and very far outside the scope of pyHanko. Whether the text appears in a signature appearance or not makes things slightly easier (in that you know which content streams to analyse), but actually extracting the text and decoding it into something readable is not always trivial. It would be doable to hack something together that works (somewhat) reliably on pyHanko output, because pyHanko is quite reasonable by default (it supplies a ToUnicode map for embedded fonts, reading order matches content stream order, etc. etc.). But even for that, I think pulling in a library that properly supports text extraction is better. iText does this, among many others: https://github.com/itext/itext-java. That said, I suspect that your question is actually an X/Y problem. Are you really trying to extract text, or do you simply want access to metadata about the signature and/or the signer? Because there are easier ways to go about that. EDIT: I'm also converting this to a discussion. |
Beta Was this translation helpful? Give feedback.
-
Hi Matthias, thanks for your reply. My company adopted this way to add more information to a pdf document without change it's content, so that additional information can be easily found in first page and highlighted with different colors. This new sign has always the same author and also the other metadata are mainly the same (ie: my company is always the author of this particular sign and the sign provider is always the same), so I think I can answer to your question saying that I don't want metadata. We are working mainly in python, and I would like to stay with this environment withouth switch to Java and restart from scratch, I don't know if is possible to integrate external Java library into python code. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm trying to get the custom text of a sign from a pdf file, I'm working like this:
I've opend the pdf file in this way
and then, inspecting runtime variables I've seen that the sign I want to read is inside this variable
(my document has two signs, I want to work on the second one.)
Inside
embedded_signature[1]
I can find all the information about the sign certificate, the provider and the owner of the sign, I can't find the text inside the sign and I would like to understand how to ge this information.To be the most clear as possibile, in the documentation this feature is used
here
to generate this sign
In this case, opening the signed pdf file, I would like to get the text
Beta Was this translation helpful? Give feedback.
All reactions