Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024 05 04 Experimental regex support (No rush to merge, proof of concept/feasibility discussion) #65

Conversation

NebularNerd
Copy link
Contributor

@NebularNerd NebularNerd commented May 6, 2024

Could close #12, but likely not (skip to conclusion for why)

Based on trying to think of a way to help improve matches further (it's a great cure for my insomnia at night) I wanted to try adding a REGEX matcher into PureMagic. This should allow for higher confidence hits on files, especially those that share common markers such a PK and PAK. This may not be the definitive solution (I'll discuss why below) but it's a start towards making PureMagic more powerful.

How it works:

  1. Scan for regular magic bytes as normal
  2. Find a matching entry inside the multi-part data
  3. Scan either a defined block size from the start of the file where we can be certain it's somewhere in that region, or scan the whole file (where we have no idea of a fixed point).
  4. Find a match and add to results list

Example test entries in the .json:

	"504b030414000600" : [
	  ["776f72642f646f63756d656e742e786d6c", 3000, ".docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "###REGEX### MS Office Open XML Format Word Document"],
	  ["786c2f776f726b626f6f6b2e786d6c", 3000, ".xlsx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "###REGEX### Microsoft Office 2007+ Open XML Format Excel Document file"],
	  ["786c2f76626150726f6a6563742e62696e", 0, ".xlsm", "application/vnd.ms-excel.sheet.macroEnabled.12","###REGEX### Microsoft Excel - Macro-Enabled Workbook"]
	]

Both .docx and .xlsx have 3000 in the offset field, this is because we can be 99% certain the matching bytes will be within the first 3000 bytes of the file. However, .xlsm has a 0 as we know what we are looking for but it could be anywhere in the file. As all three of these examples are essentially .zip files we can cheat and just use path/filenames we are expecting in the archive. As the structure is mostly rigid, we can assume (and my tests show):

  • 0x776f72642f646f63756d656e742e786d6c / word/document.xml will be in the first 3000 bytes for .docx
  • 0x786c2f776f726b626f6f6b2e786d6c / xl/workbook.xml will be in the first 3000 bytes for .xlsx
  • 0x786c2f76626150726f6a6563742e62696e / xl/vbaProject.bin will be somewhere in a .xlsm, as it comes after the spreadsheet data we cannot predict a scan area.

Implementation:

To save reinventing the wheel I have leveraged the existing multi-part system, it works for this concept and only required minimal changes to the code. In the .json, entries are treated as before with the only difference being the offset becomes a block size from byte zero to scan, or 0 if unknown, additionally we prefix the name with ###REGEX### (with a space) as a nice clear trigger that would not have any real world use.

In the code if the trigger is found in the name it will REGEX, otherwise it will perform the normal string matching as before. I ran the code through BLACK so it looks a bit wacky in the layout but that's how it wants it to be.

PROS:

This should improve anything, while using .docx, .xlsx and .xlsm for examples we could fingerprint similar PK based files.

  • .jar = 0x4d4554412d494e462f4d414e49464553542e4d46 / META-INF/MANIFEST.MF
  • .apk = 0x416e64726f69644d616e69666573742e786d6c / AndroidManifest.xml
  • .odt = 0x6d696d65747970656170706c69636174696f6e2f766e642e6f617369732e6f70656e646f63756d656e742e74657874 / mimetypeapplication/vnd.oasis.opendocument.text (This may be a fixed string, need to investigate further but for the purpose of conversation I'll include it for now)

CONS:

While this will bring better results, there are some downsides/considerations:

  • Memory size / Speed: Trying to regex a whole file could be an issue on low powered system, equally if the file is too big to read into memory in one go it will break something. This could be mitigated with some clever maths to read the file in overlapping chunks that fit within the memory, how to do that without relying on external libraries might need some sideways logic.
  • Confidence scores: The issue we now face is how to ensure we have a clear winner, especially for PK based files. In testing I can now generate a lot of 0.8's but we need to possibly change the logic so the longest match always wins and is presented as first match, see examples in Some common filetypes are not detected #12 (comment). (Should be fixed by 2024-05-06 Fix Confidence sorting #66)
  • Confidence clashes: This is slightly separate from the above, a lot of the PK based files have common roots and therefore share common files. .jar and .apk both contain META-INF/MANIFEST.MF so an .apk would likely give an equal score to a .jar. The same applies for .xlsx and xlsm, they both have xl/workbook.xml in their file structure.
  • Casing: This again applies mostly to PK based files, while it should be safe to assume that all files will always use the same case for filenames, it's entirely possible for them not to. For matching purposes, we may need to look at better fuzzy logic in the regex's to ensure that META-INF/MANIFEST.MF, Meta-Inf/Manifest.Mf and meta-inf/manifest.mf are all matchable.

CONCLUSION:

This leads to this solution likely not being the definitive one but more a starting point, a rule-based system like @cdgriffith proposes is still the better path, for this. Along those lines a better solution could be:

  • Perform initial magic.json match
  • Check if a matching 504b030414000600.py (PK) file lives inside a definitions folder
  • Process the rules inside such as in this crude example:
If "META-INF/MANIFEST.MF" and "AndroidManifest.xml" it's an .apk
If "META-INF/MANIFEST.MF" and not "AndroidManifest.xml" it's an .jar
etc...
Collate results and send back to main.py
  • Take those results, add to the string matches and sort the confidence list in order of longest byte match first, the .apk match would have three sets of bytes; PK, MANIFEST and ANDROID which would be a very long match, this should ensure we have a clearer winner.

Wow that was a lot of writing, especially as we'll likely not use this in the long run 🤣 Thoughts and suggestions?

@NebularNerd
Copy link
Contributor Author

Sorry about the million commits, still learning BLACK nuances

@cdgriffith cdgriffith changed the base branch from master to develop May 6, 2024 14:55
@cdgriffith
Copy link
Owner

Thank you for all your work towards this!

I agree with the goals of this for improved detection, but think the current JSON file is getting too limited for these more advanced techniques.

As part of the 2.0 push (spurred by you, so thank you!) I am working on switching from reading from the JSON file and either putting the data in python itself, and possibly inside a graph instead of lists, so unsure how that will change everything as of this moment.

Don't have any straight up answers at the moment, just wanted to actually reply for now to to all your hard work, thank you again!

@NebularNerd
Copy link
Contributor Author

Thanks @cdgriffith 🙂

I think if we both agree we can close this for now, I'll keep my branch open so we can borrow/steal some of the code later if we find a new home for it.

I'm glad you like the idea, but I agree we're asking a lot from the .json and it would make adding data trickier later having the mixed implementations all jumbled up.

My recent ideas in #68 and #69 regarding naming and the amount of matching we could do in a more advanced system could help push PureMagic to provide very robust and detailed confidences, it will be interesting to see how far we can go. 🙂

Comment on lines +165 to +168
if not magic_row.offset == 0:
scan_bytes = header[0 : magic_row.offset]
else:
scan_bytes = header
Copy link
Contributor

@cclauss cclauss May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not magic_row.offset == 0:
scan_bytes = header[0 : magic_row.offset]
else:
scan_bytes = header
scan_bytes = header[0 : magic_row.offset] if magic_row.offset else header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that, my Python skills are like me, ugly but functional. 🙂

This PR will likely close for reasons discussed above, but the chances are some of it will come back in one form or another in V2.0

@NebularNerd
Copy link
Contributor Author

I shall close this for now as it's not something we are going to use. I'll keep the branch alive on my fork so we can re-use so aspects in v2.0 if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some common filetypes are not detected
3 participants