-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lmdb: Reduce number of heap allocations of MDB_val #59
Comments
Hrm. Storing pointers to two MDB_vals in the Txn object sounds like it might speed things up a little. I don't think RawRead should really be an issue, though to be prudent it might be best to zero-out the MDB_val after each operation in which it is used (e.g. Please go ahead and attempt this change. I am going to merge in #57 when I fix the travis-ci job. But otherwise I would like to see what happens when this change is made. Worth noting though that at some point in the future I believe this optimization will be unnecessary. I think with the pointer passing restrictions introduced in go1.6 should allows pointer arguments to be considered as non-escaping. But I'm not sure when/if that is really going to happen. |
An interesting side effect of the proposed change is that data races from concurrent Txn use (e.g. on a Readonly Txn) would be caught by the Go race detector. In the current implementation the only data races occur in C space, without write barriers, and are invisible to the race detector. |
@lmb are you working on putting together a pull request for this? Have you been having any difficulty that I could help with? My curiosity got the best of me and I quickly threw together something that works as a proof of concept. Indeed it looks like it is possible to benefit from the suggested change, especially during transactions with lots of reads (e.g. a range scan). Turn on RawRead and you are flying, my friend (current benchmark shows %20 reduction in scan time). After this is settled I definitely think it will be worth revisiting the performance numbers using integer flags (#11). edit: It's worth noting that write performance isn't really effected because MDB_val structs for writing are stored on the C stack. |
Hi Bryan,
I was going to look into this today. Easter weekend got the best of me :)
Do you have your code on a branch somewhere? Happy to polish it up if that
needs doing.
Nice that this gives such a significant speed up!
|
Oh, of course. How rude of me. Weekends and holidays especially can fail to register with me. Forgive me, and don't worry about it. I'm glad you enjoyed yourself 😄 I pushed a commit to the branch bmatsuo/reuse-mdb-val. It's is the sum of about 30 minutes of work so it is rough. All transactions allocate two MDB_vals when they are created. I think there is at least one important thing that could be improved. Specifically, update transactions which will only issue Put commands do not need to allocate these MDB_vals (though PutReserve is special and requires one val), and the benchmarks in which this occur are slightly slower as I have written things. It's also worth considering that transactions which are opened to Get exactly one key will not get any faster from this change (there is no chance to re-use the allocated MDB_vals). And if you attempt some kind of lazy allocation strategy to accommodate for write-only transactions (as I mentioned in the previous paragraph) then you are going to make these transactions with a single Get slower. So there is a balancing act that must be done here. I think you really just need to think it through a bit and potentially go through several permutations of ideas in the code. Let me know if you have any thoughts or questions. Cheers. |
No worries, I'll take a look. Which invocation do you use for benchmarks? Re the single Get use case: in my opinion this will always be the worst
|
When I run the benchmarks I use You are mostly correct RE a single Get. Though I think it is actually probably quite common in reality. The overhead for any single-op transaction may outweigh allocation overhead. This has got me thinking though, one thing I made no attempt to handle was subtransactions. It should actually be completely safe for subtractions to use their parent's MDB_val objects (parents must never be used while they have an active child). Subtransactions provide a batching mechanism for updates that is potentially quite useful for some applications. And it would be nice for them to benefit from reduced allocations. |
Oh. And if you don't know about the benchcmp tool then you should be using that to compare benchmark results for different code versions. Just figured you should know if you didn't :) |
I implemented you change, which gets rid of the allocation cost for subtransactions. Regarding slowing down Get, this is what I get when benchmarking the same code twice on my OS X box:
Seems like the timing info is very unreliable. |
I also looked at reducing the number of allocations done in a Put, but no such luck. Do you know why there are 4 allocations happening? I would expect 2 allocations maybe, for the two unsafe.Pointer coversions. |
I believe that the unexpected allocations you see are happening in the BenchmarkTxn_Put function itself (in bench_test.go). It is unfortunate that it throws the numbers off. It always good to verify tests still pass 😄 But indeed the timing is pretty erratic. And this doesn't seem too different from what I would expect. As long as relevant benchmarks go down that is good. But you typically do see weird spikes on unrelated bits. Like this,
No idea why that should be ~300ns faster now. But you also see a similar reduction in scan time. That is good. As a general word of caution though, it can be good to take a look at the benchmark to know what is really being timed. Some things may not work as you expect them to (also evidenced from you question about Put allocations). In the case of |
Also out of curiosity, what os/arch are you testing on? And are you in a VM? |
OS X 10.11.4, x86_64, with the kernel pprof fix applied. How do you want to proceed with this? I can create a PR, but frankly you've done the work. I just added 4 lines for the subtxn case. I did look at the Put benchmark:
If I'm interpreting b.ResetTimer() correctly, only the allocations on line 82 & 93 should be taken into account for the benchmem output. This means that something like 99% of the 4 allocations per Put are really allocated in Put? |
I think you can go ahead and open a PR with subtxn stuff. The commit I made is tagged with name/email. I don't think there's an attribution issue or anything. I'm confused about that pprof output. Is it a memprofile? I actually have never been able to make effective use of a memprofile output for some reason. |
Yeah thats a memprofile output, with -alloc_objects passed to |
Oh, interesting. Maybe that flag is what has been keeping -memprofile from being useful to me... Thanks for the info. I think I understand what you meant in your comment now. But I don't think the interpretation makes sense given the data. If it is the correct interpretation then I suspect there is a bug in profiling related to cgo. Lets look at the following two lines.
So, makeBenchDBVal allocated 21 objects (not 21 bytes, 21 slices). And txn.Put allocated 20,000X more objects. But, because one call always follows the other that means that txn.Put allocated 20000 objects per iteration on average. But as we've seen the numbers say that the benchmark allocated 4 objects per iteration. So my conclusion has to be that somewhere an assumption of how the tools work is wrong or there is a bug in one of the tools. The number of txn.Put isn't the sum of all calls to Txn.Put is it? It is the sum of calls for which lmdb.BenchmarkTxn_Put.func1 was directly below it in the stack (potentially even more granular that that). Right? If I'm wrong about that then I'm not sure what to take away from the snippet you have provided 😕 |
Oh. I think pprof may be counting allocations using malloc in C as well. If you look at the numbers using |
The confusing bit is that the output from go test only includes allocations between ResetTimer and StopTimer, but the memprofile includes all allocations. I removed the makeBenchDBVal call from the benchmark loop, and Maybe this should be a separate issue?
|
Umm. Wait.. So you made a change which should have reduced the number of allocs/op as reported by the benchmark output, and the number of allocs/op did not change? This sounds like a bug in "testing" package, then. No? |
This should probably be in a separate issue regardless. |
I created #63 for this issue |
I'm currently looking at allocation counts in the codebase I am working on, and I noticed that lmdb-go allocates short lived MDB_val objects, e.g. on every call to Txn.Get.
Would it be possible / safe to cache one or two MDB_val in Txn, and re-use them in Txn.Get and friends? I think there might be an issue with RawRead Txns, but I'm not sure I understand the code base sufficiently.
If this sounds like it could work I'd be happy to contribute a PR. It is not a straight up performance win, but it should reduce the pressure on the GC when querying lots and lots.
The text was updated successfully, but these errors were encountered: