Replies: 2 comments 2 replies
-
Hello, I'm not convinced that this library is the right place, to be honest. "Superficial information" is a pretty broad term: even if we take that to mean just "metadata", PDF documents can contain all sorts of metadata: there's the Info dictionary, XMP document metadata, metadata in embedded files, EXIF (or other) metadata embedded in images, font metadata, colour profile metadata, signature production metadata, yada yada yada. Oh, and I didn't even mention all the ways proprietary tools embed their own private data into files. Stripping all of that out consistently is (a) not exactly easy, (b) goes against the spirit of standards like PDF/A, so many people won't even want to do this, and (c) can even have copyright/licencing implications in some cases. I don't fully follow your concern about attacks on the document generation pipeline, but let's accept that for the sake of the argument. Regardless, IMHO metadata stripping doesn't really relate to PDF signing, which puts it out of scope for pyHanko---and given that I'm the person maintaining this lib, I try to be vigilant about scope creep... :) Of course, nothing stops you from rolling your own metadata stripper using the low-level PDF API exposed by pyHanko (or any other PDF library), but be warned: you'll have to dig quite deeply to get all of it out! |
Beta Was this translation helpful? Give feedback.
-
I didn't mean clearing everything from everywhere, especially metadata in embedded files; merely in final PDF. Should any files be embedded in the document, those files should be "cleaned up" before getting embedded. I don't think anything other than Info in PDF should even be considered here, but if I'm the only one interested in this, then I'll obviously whip something up on my end :) There's internal document title, some paths from LibreOffice, LO's version, PyHanko's version; Dangerous territories are PDF generation from ODT template by LibreOffice, and then signature processing in PDFs coming in from "outside world" by PyHanko. I'd like to not advertise presence of those tools, especially with their version numbers. Although it is the definition of security by obscurity, it's harder to craft specific files if one doesn't know what software they are targetting. |
Beta Was this translation helpful? Give feedback.
-
Hi!
Could we have an option to remove any superficial information from PDF before certification? Produced PDFs contain unnecessary information about software used to generate initial document; I don't see any point of having this information leaked to document recipients other than potential attacks on generation pipeline. While I probably could filter this out on earlier steps, PyHanko seems like an ideal place for this, since it sits on the last step before adding certification signature.
Thanks for considering.
Beta Was this translation helpful? Give feedback.
All reactions