How to get custom text in sign #426

Asterix45 · 2024-04-22T12:45:14Z

Asterix45
Apr 22, 2024

Hello,

I'm trying to get the custom text of a sign from a pdf file, I'm working like this:
I've opend the pdf file in this way

with open('file.pdf', 'rb') as doc:
    r = PdfFileReader(doc)

and then, inspecting runtime variables I've seen that the sign I want to read is inside this variable

r.embedded_signatures[1]

(my document has two signs, I want to work on the second one.)

Inside embedded_signature[1] I can find all the information about the sign certificate, the provider and the owner of the sign, I can't find the text inside the sign and I would like to understand how to ge this information.

To be the most clear as possibile, in the documentation this feature is used

here

to generate this sign

In this case, opening the signed pdf file, I would like to get the text

"This is custom text!
Signed by: Alice [email protected]
Time: 2021-06-24 08:00:00 CEST"

MatthiasValvekens · 2024-05-01T09:47:15Z

MatthiasValvekens
May 1, 2024
Maintainer

Hi @Asterix45,

That is a surprisingly hard problem ;). Text extraction from general PDFs is a whole domain in itself, and very far outside the scope of pyHanko. Whether the text appears in a signature appearance or not makes things slightly easier (in that you know which content streams to analyse), but actually extracting the text and decoding it into something readable is not always trivial.

It would be doable to hack something together that works (somewhat) reliably on pyHanko output, because pyHanko is quite reasonable by default (it supplies a ToUnicode map for embedded fonts, reading order matches content stream order, etc. etc.). But even for that, I think pulling in a library that properly supports text extraction is better. iText does this, among many others: https://github.com/itext/itext-java.

That said, I suspect that your question is actually an X/Y problem. Are you really trying to extract text, or do you simply want access to metadata about the signature and/or the signer? Because there are easier ways to go about that.

EDIT: I'm also converting this to a discussion.

0 replies

Asterix45 · 2024-05-03T07:12:50Z

Asterix45
May 3, 2024
Author

Hi Matthias, thanks for your reply.

My company adopted this way to add more information to a pdf document without change it's content, so that additional information can be easily found in first page and highlighted with different colors.

This new sign has always the same author and also the other metadata are mainly the same (ie: my company is always the author of this particular sign and the sign provider is always the same), so I think I can answer to your question saying that I don't want metadata.
The content that we add as sign (the text I need to read) change every time: we add a progressive number that is unique by the year, so that is possible to have a formal reference to every document (document unique number 123456/2024).

We are working mainly in python, and I would like to stay with this environment withouth switch to Java and restart from scratch, I don't know if is possible to integrate external Java library into python code.
In the same application I'm using pypdf to extract the usual text from the pdf but doesn't seem to me that signs are available to get this kind of information.

3 replies

MatthiasValvekens May 3, 2024
Maintainer

Ah, if you're creating the documents, why not add the data you need as a private extension to the signature field object in the PDF, instead of trying to extract it from the page conrent? The former is pretty easy, the latter is a lot harder :)

When writing the signature, you can just add sig_field['/MyCustomBlah'] = something something when preparing the field, and then retrieve that info later when reading the file.

Asterix45 May 4, 2024
Author

That's could be a good point, but the BU that manage the signing of pdf files has it's own software, I don't know in which language they works, seems a bit difficult to have some modification. We are also planning to change that software, so I don't know if we have resources to develop some cr.

I'm studying about pdf sign, they are made with sha256 or sha256rsa; the key is inside the pdf, it seems to me that I can get it from the pdf, I'm thinking to get the key and try to get the plain text... I don't know if this will be possible.

MatthiasValvekens May 5, 2024
Maintainer

I'm studying about pdf sign, they are made with sha256 or sha256rsa; the key is inside the pdf, it seems to me that I can get it from the pdf, I'm thinking to get the key and try to get the plain text... I don't know if this will be possible.

That's...not how this works. You have the payload that was signed available to you in plain text already, so diving into the cryptographic part of the story won't buy you anything useful. The problem is that extracting human-readable text from a PDF content stream is not trivial in the general case. So either you find a library that does that job for you, or you figure out a solution with your colleagues from that other BU ;). Or you implement an ad-hoc solution yourself, but unless you know what you're getting into, this will be much more expensive than either alternative--and it risks breaking when the signing code changes.

With the information I have, I maintain that trying to extract this info from the annotation's appearance stream is the wrong way to go about this, but you do you.

PS: The relevant parts of the standard (ISO 32000-2:2020) that you'll want to start reading are clauses 9.4 ("Text objects"), 9.10 ("Extraction of text content") and 12.5.5 ("Appearance streams"). I suspect that you'll see what I mean by "not trivial" pretty soon ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get custom text in sign #426

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to get custom text in sign #426

Asterix45 Apr 22, 2024

Replies: 2 comments · 3 replies

MatthiasValvekens May 1, 2024 Maintainer

Asterix45 May 3, 2024 Author

MatthiasValvekens May 3, 2024 Maintainer

Asterix45 May 4, 2024 Author

MatthiasValvekens May 5, 2024 Maintainer

Asterix45
Apr 22, 2024

Replies: 2 comments 3 replies

MatthiasValvekens
May 1, 2024
Maintainer

Asterix45
May 3, 2024
Author

MatthiasValvekens May 3, 2024
Maintainer

Asterix45 May 4, 2024
Author

MatthiasValvekens May 5, 2024
Maintainer