Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use checksum from storage server instead of calculating it always #20

Open
egabancho opened this issue Apr 29, 2020 · 5 comments
Open
Assignees

Comments

@egabancho
Copy link
Member

Right now when asl for the checksum of a file we digest the file and calculate it application-side, it would be nice to return the value the storage server is giving us directly, similar to https://github.com/inveniosoftware/invenio-xrootd/blob/master/invenio_xrootd/storage.py#L60

@egabancho egabancho self-assigned this Apr 29, 2020
@ppanero
Copy link
Member

ppanero commented Apr 29, 2020

Hello! One quick question, this is just once it was upload, just to serve? Otherwise, what would happen in this two scenarios:

  • The storage somehow corrupts the file. Then is not our responsibility (isn't it?).
  • The file gets corrupted in the wire (between Invenio and Storage) and we have no means of crosschecking if we do not calculate the checksum in Invenio.

@wgresshoff
Copy link
Contributor

I see just one problem with that approach, but perhaps I'm not aware of a possible solution or it's handled otherwise ;)
When using MultiPart-Uploads checksums are calculated for every part that's uploaded. Then it's used to calculate the final checksum. Is the result really usable? The normal checksum type in S3 is md5, but I can't imagine thats correct for multipart uploads.

@egabancho
Copy link
Member Author

@wgresshoff you are definetly right, and I don't have an answer for that ☺️

@egabancho
Copy link
Member Author

@ppanero The checksum that it's stored to Invenio's database gets calculated at upload time, i.e. we do the content digest and the hex hash. So yes, it's just to do integrity checks afterward.

Now, if we want to verify the file integrity we could do two things, (i) ask the storage server for the checksum and compare with the one we have stored (from the upload) or (ii) calculate the checksum on our end and compare it with the one we have stored.

The first option is only doable right now for smaller files, i.e. no multipart uploads, as soon as you upload a big file you get an Etag that is the combination of the hashes of each of its parts (what @wgresshoff pointed out)

The problem with the second option is that it's time-consuming, you have to read the entire file, but it works for all small and big files. Plus if you use a service, say AWS S3, you have to pay for the extra traffic.

Perhaps "the middle way" might be the solution here, if we can get it from the server, use it, otherwise calculate it...

@ppanero
Copy link
Member

ppanero commented Apr 30, 2020

Middle way seems the best trade-off, thanks for the explanations :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants