EXIF-style sanitization #353

ofcaah · 2023-12-05T13:40:39Z

ofcaah
Dec 5, 2023

Hi!

Could we have an option to remove any superficial information from PDF before certification? Produced PDFs contain unnecessary information about software used to generate initial document; I don't see any point of having this information leaked to document recipients other than potential attacks on generation pipeline. While I probably could filter this out on earlier steps, PyHanko seems like an ideal place for this, since it sits on the last step before adding certification signature.

Thanks for considering.

MatthiasValvekens · 2023-12-07T20:55:53Z

MatthiasValvekens
Dec 7, 2023
Maintainer

Hello,

I'm not convinced that this library is the right place, to be honest. "Superficial information" is a pretty broad term: even if we take that to mean just "metadata", PDF documents can contain all sorts of metadata: there's the Info dictionary, XMP document metadata, metadata in embedded files, EXIF (or other) metadata embedded in images, font metadata, colour profile metadata, signature production metadata, yada yada yada. Oh, and I didn't even mention all the ways proprietary tools embed their own private data into files. Stripping all of that out consistently is (a) not exactly easy, (b) goes against the spirit of standards like PDF/A, so many people won't even want to do this, and (c) can even have copyright/licencing implications in some cases.

I don't fully follow your concern about attacks on the document generation pipeline, but let's accept that for the sake of the argument. Regardless, IMHO metadata stripping doesn't really relate to PDF signing, which puts it out of scope for pyHanko---and given that I'm the person maintaining this lib, I try to be vigilant about scope creep... :)

Of course, nothing stops you from rolling your own metadata stripper using the low-level PDF API exposed by pyHanko (or any other PDF library), but be warned: you'll have to dig quite deeply to get all of it out!

0 replies

ofcaah · 2023-12-07T21:14:18Z

ofcaah
Dec 7, 2023
Author

I didn't mean clearing everything from everywhere, especially metadata in embedded files; merely in final PDF. Should any files be embedded in the document, those files should be "cleaned up" before getting embedded. I don't think anything other than Info in PDF should even be considered here, but if I'm the only one interested in this, then I'll obviously whip something up on my end :)

There's internal document title, some paths from LibreOffice, LO's version, PyHanko's version; Dangerous territories are PDF generation from ODT template by LibreOffice, and then signature processing in PDFs coming in from "outside world" by PyHanko. I'd like to not advertise presence of those tools, especially with their version numbers. Although it is the definition of security by obscurity, it's harder to craft specific files if one doesn't know what software they are targetting.

2 replies

MatthiasValvekens Dec 7, 2023
Maintainer

Sure, but arguably that involves cleaning data specifically generated by the toolchain that you happen to be using, so I'm still not getting why that belongs in a general-purpose signing toolkit :).

Sounds like you know what data you want to remove---nothing stops you from implementing the removal procedure yourself! Incidentally, you can suppress pyHanko's own metadata update operations by overriding _update_meta in your IncrementalPdfFileWriter to always return None.

ofcaah Dec 7, 2023
Author

aye; the fact that not much can be changed (quietly) post-certify lead me to think this might be right spot to also do some cleaning. Oh well, I'll do the preprocessing in the signing script with your hints, and should anyone have same idea, they might stumble upon this thread. Thank you for the tips! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXIF-style sanitization #353

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

EXIF-style sanitization #353

ofcaah Dec 5, 2023

Replies: 2 comments · 2 replies

MatthiasValvekens Dec 7, 2023 Maintainer

ofcaah Dec 7, 2023 Author

MatthiasValvekens Dec 7, 2023 Maintainer

ofcaah Dec 7, 2023 Author

ofcaah
Dec 5, 2023

Replies: 2 comments 2 replies

MatthiasValvekens
Dec 7, 2023
Maintainer

ofcaah
Dec 7, 2023
Author

MatthiasValvekens Dec 7, 2023
Maintainer

ofcaah Dec 7, 2023
Author