Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept Iterable for Performance #31

Open
william-silversmith opened this issue Nov 19, 2021 · 6 comments
Open

Accept Iterable for Performance #31

william-silversmith opened this issue Nov 19, 2021 · 6 comments
Assignees

Comments

@william-silversmith
Copy link

william-silversmith commented Nov 19, 2021

Hi, thanks so much for this very useful library! I'm using it to randomize keys for billions of objects to create condensed files that contain groups of thousands to millions of objects at a time.

https://github.com/google/neuroglancer/blob/056a3548abffc3c76c93c7a906f1603ce02b5fa3/src/neuroglancer/datasource/precomputed/sharded.md

It's not critical, but there is a bottleneck step in the front of my Python processing pipeline where the hash is applied to all object labels at once to figure out how to assign them for further processing. The hash function is dominating this calculation.

Function: murmur at line 25
Time per Hit in microseconds
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                           @profile
    26                                           def murmur(x):
    27     69734      74332.0      1.1      1.6    y = uint64(x).tobytes()
    28     69734    4381932.0     62.8     97.2    y = mmh3.hash64(y, x64arch=False)
    29     69733      52731.0      0.8      1.2    return uint64(y[0])

Total time: 5.44635 s
File: REDACTED
Function: compute_shard_location at line 145
Time per Hit in microseconds
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   145                                             @profile
   146                                             def compute_shard_location(self, key):
   147     69734      99805.0      1.4      1.8      chunkid = uint64(key) >> uint64(self.preshift_bits)
   148     69734    4703010.0     67.4     86.4      chunkid = self.hashfn(chunkid)
   149     69733      60072.0      0.9      1.1      minishard_number = uint64(chunkid & self.minishard_mask)
   150     69733      97034.0      1.4      1.8      shard_number = uint64((chunkid & self.shard_mask) >> uint64(self.minishard_bits))
   151     69733     312626.0      4.5      5.7      shard_number = format(shard_number, 'x').zfill(int(np.ceil(self.shard_bits / 4.0)))
   152     69733     102740.0      1.5      1.9      remainder = chunkid >> uint64(self.minishard_bits + self.shard_bits)
   153                                           
   154     69733      71060.0      1.0      1.3      return ShardLocation(shard_number, minishard_number, remainder)

It could be possible to thread this processing, but Python has the GIL. Multiprocessing could work, though the picke/unpickle will also take some time. I was thinking that a neat way to increase thoughput would be to process multiple hashes at once in C, that is, accept both a scalar and an iterator as input to the function. This would allow for the compiler to autovetorize and also avoid Python/C overheads. I'm getting ~66.5k hashes/sec on Apple Silicon M1 ARM64 currently.

I'm thinking of an interface similar to this. The second should be some buffer that is easy to read into numpy.

(lower, upper) = mmh3.hash64(some_bytes, x64arch=False)
[l1,h1,l2,h2] = mmh3.hash64(iterable_containing_bytes, x64arch=False)

Thank you for your consideration and for all the effort you've put into this library!

@hajimes
Copy link
Owner

hajimes commented Mar 24, 2023

Sorry for the delay in my response.

Currently my highest priority for the next update is to introduce a hashlib-compliant interface (#39). But I think your concern is important too, so I'll check if your proposal can be also implemented at that time.

Thank you for elaborating your ideas!

@hajimes hajimes self-assigned this Mar 24, 2023
@william-silversmith
Copy link
Author

william-silversmith commented Mar 24, 2023

No worries! I ended up implementing this myself please feel free to use this code or take inspiration from it (though I doubt you'll need it):

https://github.com/seung-lab/shard-computer/blob/main/MurmurHash3.cpp#L342-L391
https://github.com/seung-lab/shard-computer/blob/main/shardcomputer.cpp#L52-L78

@hajimes
Copy link
Owner

hajimes commented Mar 25, 2023

I really appreciate your work! I'm relatively relunctant to make mmh3 dependent on other modules (pybind11), but I'll try to do similar things.

Currently mmh3.hash_from_buffer accepts a bytearray (including numpy.ndarray) but justs returns one value. Perhaps I could write a function that interprets an ndarray as a list of (fixed-size) objects, then returns a bytearray of hash results (by using the 128 bit version of mmh3).

@william-silversmith
Copy link
Author

At least for my use case, the 64 bit function was mandated by a spec, so it might be good to have two versions if possible. Though, my use case is taken care of, so don't worry about me specifically. Thank you for addressing this!

https://github.com/seung-lab/cloud-volume/blob/master/cloudvolume/datasource/precomputed/sharding.py#LL64C2-L65C2

@ruyimarone
Copy link

Currently mmh3.hash_from_buffer accepts a bytearray (including numpy.ndarray) but justs returns one value. Perhaps I could write a function that interprets an ndarray as a list of (fixed-size) objects, then returns a bytearray of hash results (by using the 128 bit version of mmh3).

Really appreciate this library! Any upcoming plans to implement the functionality described above? It would be extremely handy for a project I'm working on.

@hajimes
Copy link
Owner

hajimes commented Jul 26, 2023

Hi. Thanks so much for being interested in the library!

Right now I don't have time to make a major update to this project, but I'll try to do so in the second or third week in August (though I cannot guarantee it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants