Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lmdb: Reduce number of heap allocations of MDB_val #59

Closed
lmb opened this issue Mar 24, 2016 · 21 comments
Closed

lmdb: Reduce number of heap allocations of MDB_val #59

lmb opened this issue Mar 24, 2016 · 21 comments
Milestone

Comments

@lmb
Copy link
Contributor

lmb commented Mar 24, 2016

I'm currently looking at allocation counts in the codebase I am working on, and I noticed that lmdb-go allocates short lived MDB_val objects, e.g. on every call to Txn.Get.

Would it be possible / safe to cache one or two MDB_val in Txn, and re-use them in Txn.Get and friends? I think there might be an issue with RawRead Txns, but I'm not sure I understand the code base sufficiently.

If this sounds like it could work I'd be happy to contribute a PR. It is not a straight up performance win, but it should reduce the pressure on the GC when querying lots and lots.

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 24, 2016

Hrm. Storing pointers to two MDB_vals in the Txn object sounds like it might speed things up a little. I don't think RawRead should really be an issue, though to be prudent it might be best to zero-out the MDB_val after each operation in which it is used (e.g. *txn.key = C.MDB_val{}).

Please go ahead and attempt this change. I am going to merge in #57 when I fix the travis-ci job. But otherwise I would like to see what happens when this change is made.

Worth noting though that at some point in the future I believe this optimization will be unnecessary. I think with the pointer passing restrictions introduced in go1.6 should allows pointer arguments to be considered as non-escaping. But I'm not sure when/if that is really going to happen.

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 25, 2016

An interesting side effect of the proposed change is that data races from concurrent Txn use (e.g. on a Readonly Txn) would be caught by the Go race detector. In the current implementation the only data races occur in C space, without write barriers, and are invisible to the race detector.

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 30, 2016

@lmb are you working on putting together a pull request for this? Have you been having any difficulty that I could help with?

My curiosity got the best of me and I quickly threw together something that works as a proof of concept. Indeed it looks like it is possible to benefit from the suggested change, especially during transactions with lots of reads (e.g. a range scan). Turn on RawRead and you are flying, my friend (current benchmark shows %20 reduction in scan time).

After this is settled I definitely think it will be worth revisiting the performance numbers using integer flags (#11).

edit: It's worth noting that write performance isn't really effected because MDB_val structs for writing are stored on the C stack.

@lmb
Copy link
Contributor Author

lmb commented Mar 30, 2016 via email

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 30, 2016

Oh, of course. How rude of me. Weekends and holidays especially can fail to register with me. Forgive me, and don't worry about it. I'm glad you enjoyed yourself 😄

I pushed a commit to the branch bmatsuo/reuse-mdb-val. It's is the sum of about 30 minutes of work so it is rough.

All transactions allocate two MDB_vals when they are created. I think there is at least one important thing that could be improved. Specifically, update transactions which will only issue Put commands do not need to allocate these MDB_vals (though PutReserve is special and requires one val), and the benchmarks in which this occur are slightly slower as I have written things.

It's also worth considering that transactions which are opened to Get exactly one key will not get any faster from this change (there is no chance to re-use the allocated MDB_vals). And if you attempt some kind of lazy allocation strategy to accommodate for write-only transactions (as I mentioned in the previous paragraph) then you are going to make these transactions with a single Get slower.

So there is a balancing act that must be done here. I think you really just need to think it through a bit and potentially go through several permutations of ideas in the code. Let me know if you have any thoughts or questions.

Cheers.

@lmb
Copy link
Contributor Author

lmb commented Mar 30, 2016

No worries, I'll take a look. Which invocation do you use for benchmarks?

Re the single Get use case: in my opinion this will always be the worst
case, since you are incurring the allocation overhead for a tx every single
time (this includes acquiring a semaphore and writing to the lockfile).
While its important to not make this dog slow, anyone trying to seriously
use lmdb from Go has to batch operations anyways. Thoughts?
On 30 Mar 2016 8:23 am, "Bryan Matsuo" [email protected] wrote:

Oh, of course. How rude of me. Weekends and holidays especially can fail
to register with me. Forgive me, and don't worry about it. I'm glad you
enjoyed yourself [image: 😄]

I pushed a commit to the branch bmatsuo/reuse-mdb-val. It's is the sum of
about 30 minutes of work so it is rough.

All transactions allocate two MDB_vals when they are created. I think
there is at least one important thing that could be improved. Specifically,
update transactions which will only issue Put commands do not need to
allocate these MDB_vals (though PutReserve is special and requires one
val), and the benchmarks in which this occur are slightly slower as I have
written things.

It's also worth considering that transactions which are opened to Get
exactly one key will not get any faster from this change (there is no
chance to re-use the allocated MDB_vals). And if you attempt some kind of
lazy allocation strategy to accommodate for write-only transactions (as I
mentioned in the previous paragraph) then you are going to make these
transactions with a single Get slower.

So there is a balancing act that most be done here. I think you really
just need to think it through a bit and potentially go through several
permutations of ideas in the code. Let me know if you have any thoughts or
questions.

Cheers.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#59 (comment)

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 30, 2016

When I run the benchmarks I use go test -run=None -bench=.. In this case adding the -benchmem flag may be good too. If I happen to be feeling paranoid about runtime variance I will use -benchtime=2s (the default is 1s).

You are mostly correct RE a single Get. Though I think it is actually probably quite common in reality. The overhead for any single-op transaction may outweigh allocation overhead.

This has got me thinking though, one thing I made no attempt to handle was subtransactions. It should actually be completely safe for subtractions to use their parent's MDB_val objects (parents must never be used while they have an active child). Subtransactions provide a batching mechanism for updates that is potentially quite useful for some applications. And it would be nice for them to benefit from reduced allocations.

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 30, 2016

Oh. And if you don't know about the benchcmp tool then you should be using that to compare benchmark results for different code versions. Just figured you should know if you didn't :)

@lmb
Copy link
Contributor Author

lmb commented Mar 30, 2016

I implemented you change, which gets rid of the allocation cost for subtransactions. Regarding slowing down Get, this is what I get when benchmarking the same code twice on my OS X box:

lorenz@antares ~/go/src/github.com/bmatsuo/lmdb-go (reuse-mdb-val)*
$ benchcmp new.txt new2.txt
benchmark                              old ns/op     new ns/op     delta
BenchmarkEnv_ReaderList-4              59673         58537         -1.90%
BenchmarkTxn_Put-4                     2073          2010          -3.04%
BenchmarkTxn_PutReserve-4              2805          1847          -34.15%
BenchmarkTxn_PutReserve_writemap-4     2101          1525          -27.42%
BenchmarkTxn_Put_writemap-4            2855          1632          -42.84%
BenchmarkTxn_Get_ro-4                  1423          1446          +1.62%
BenchmarkTxn_Get_raw_ro-4              1075          855           -20.47%
BenchmarkScan_ro-4                     7488830       5804052       -22.50%
BenchmarkScan_raw_ro-4                 1710047       1299006       -24.04%
BenchmarkCursor-4                      1118          829           -25.85%
BenchmarkCursor_Renew-4                198           196           -1.01%
BenchmarkTxn_Sub_commit-4              293740        149557        -49.09%
BenchmarkTxn_Sub_abort-4               281365        145835        -48.17%
BenchmarkTxn_abort-4                   19417         16261         -16.25%
BenchmarkTxn_commit-4                  17269         16284         -5.70%
BenchmarkTxn_ro-4                      122001        124584        +2.12%
BenchmarkTxn_unmanaged_abort-4         16617         16634         +0.10%
BenchmarkTxn_unmanaged_commit-4        16623         16808         +1.11%
BenchmarkTxn_unmanaged_ro-4            113743        120440        +5.89%
BenchmarkTxn_renew-4                   620           575           -7.26%
BenchmarkTxn_Put_append-4              789           808           +2.41%
BenchmarkTxn_Put_append_noflag-4       1013          1021          +0.79%

benchmark                              old allocs     new allocs     delta
BenchmarkEnv_ReaderList-4              113            113            +0.00%
BenchmarkTxn_Put-4                     4              4              +0.00%
BenchmarkTxn_PutReserve-4              2              2              +0.00%
BenchmarkTxn_PutReserve_writemap-4     2              2              +0.00%
BenchmarkTxn_Put_writemap-4            4              4              +0.00%
BenchmarkTxn_Get_ro-4                  3              3              +0.00%
BenchmarkTxn_Get_raw_ro-4              3              2              -33.33%
BenchmarkScan_ro-4                     8152           8127           -0.31%
BenchmarkScan_raw_ro-4                 5              5              +0.00%
BenchmarkCursor-4                      3              3              +0.00%
BenchmarkCursor_Renew-4                0              0              +0.00%
BenchmarkTxn_Sub_commit-4              3              3              +0.00%
BenchmarkTxn_Sub_abort-4               3              3              +0.00%
BenchmarkTxn_abort-4                   5              5              +0.00%
BenchmarkTxn_commit-4                  5              5              +0.00%
BenchmarkTxn_ro-4                      5              5              +0.00%
BenchmarkTxn_unmanaged_abort-4         5              5              +0.00%
BenchmarkTxn_unmanaged_commit-4        5              5              +0.00%
BenchmarkTxn_unmanaged_ro-4            5              5              +0.00%
BenchmarkTxn_renew-4                   0              0              +0.00%
BenchmarkTxn_Put_append-4              5              5              +0.00%
BenchmarkTxn_Put_append_noflag-4       5              5              +0.00%

benchmark                              old bytes     new bytes     delta
BenchmarkEnv_ReaderList-4              7408          7408          +0.00%
BenchmarkTxn_Put-4                     96            96            +0.00%
BenchmarkTxn_PutReserve-4              48            48            +0.00%
BenchmarkTxn_PutReserve_writemap-4     48            48            +0.00%
BenchmarkTxn_Put_writemap-4            96            96            +0.00%
BenchmarkTxn_Get_ro-4                  1149          1158          +0.78%
BenchmarkTxn_Get_raw_ro-4              66            65            -1.52%
BenchmarkScan_ro-4                     9106023       9202313       +1.06%
BenchmarkScan_raw_ro-4                 72            72            +0.00%
BenchmarkCursor-4                      33            33            +0.00%
BenchmarkCursor_Renew-4                0             0             +0.00%
BenchmarkTxn_Sub_commit-4              65            65            +0.00%
BenchmarkTxn_Sub_abort-4               65            65            +0.00%
BenchmarkTxn_abort-4                   97            97            +0.00%
BenchmarkTxn_commit-4                  97            97            +0.00%
BenchmarkTxn_ro-4                      97            97            +0.00%
BenchmarkTxn_unmanaged_abort-4         97            97            +0.00%
BenchmarkTxn_unmanaged_commit-4        97            97            +0.00%
BenchmarkTxn_unmanaged_ro-4            97            97            +0.00%
BenchmarkTxn_renew-4                   0             0             +0.00%
BenchmarkTxn_Put_append-4              104           104           +0.00%
BenchmarkTxn_Put_append_noflag-4       104           104           +0.00%

Seems like the timing info is very unreliable.

@lmb
Copy link
Contributor Author

lmb commented Mar 30, 2016

I also looked at reducing the number of allocations done in a Put, but no such luck. Do you know why there are 4 allocations happening? I would expect 2 allocations maybe, for the two unsafe.Pointer coversions.

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 30, 2016

I believe that the unexpected allocations you see are happening in the BenchmarkTxn_Put function itself (in bench_test.go). It is unfortunate that it throws the numbers off.

It always good to verify tests still pass 😄 But indeed the timing is pretty erratic. And this doesn't seem too different from what I would expect. As long as relevant benchmarks go down that is good. But you typically do see weird spikes on unrelated bits. Like this,

BenchmarkCursor-4                      1118          829           -25.85%

No idea why that should be ~300ns faster now.

But you also see a similar reduction in scan time. That is good.

As a general word of caution though, it can be good to take a look at the benchmark to know what is really being timed. Some things may not work as you expect them to (also evidenced from you question about Put allocations).

In the case of BenchmarkScan_raw_ro, transaction initialization and cleanup is not being counted. It is worth knowing.

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 30, 2016

Also out of curiosity, what os/arch are you testing on? And are you in a VM?

@lmb
Copy link
Contributor Author

lmb commented Mar 30, 2016

OS X 10.11.4, x86_64, with the kernel pprof fix applied. How do you want to proceed with this? I can create a PR, but frankly you've done the work. I just added 4 lines for the subtxn case.

I did look at the Put benchmark:

(pprof) list lmdb.BenchmarkTxn_Put.func1
Total: 506370
ROUTINE ======================== github.com/bmatsuo/lmdb-go/lmdb.BenchmarkTxn_Put.func1 in /Users/lorenz/go/src/github.com/bmatsuo/lmdb-go/lmdb/bench_test.go
         0     440425 (flat, cum) 86.98% of Total
         .          .     77:   err = env.Update(func(txn *Txn) (err error) {
         .          .     78:       b.ResetTimer()
         .          .     79:       defer b.StopTimer()
         .          .     80:       for i := 0; i < b.N; i++ {
         .          .     81:           k := ps[rand.Intn(len(ps)/2)*2]
         .         21     82:           v := makeBenchDBVal(&rc)
         .     440404     83:           err := txn.Put(dbi, k, v, 0)
         .          .     84:           if err != nil {
         .          .     85:               return err
         .          .     86:           }
         .          .     87:       }
         .          .     88:       return nil

If I'm interpreting b.ResetTimer() correctly, only the allocations on line 82 & 93 should be taken into account for the benchmem output. This means that something like 99% of the 4 allocations per Put are really allocated in Put?

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 31, 2016

I think you can go ahead and open a PR with subtxn stuff. The commit I made is tagged with name/email. I don't think there's an attribution issue or anything.

I'm confused about that pprof output. Is it a memprofile? I actually have never been able to make effective use of a memprofile output for some reason.

@lmb
Copy link
Contributor Author

lmb commented Mar 31, 2016

Yeah thats a memprofile output, with -alloc_objects passed to go tool pprof. It shows the number of allocations done per line.

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 31, 2016

Oh, interesting. Maybe that flag is what has been keeping -memprofile from being useful to me... Thanks for the info.

I think I understand what you meant in your comment now. But I don't think the interpretation makes sense given the data. If it is the correct interpretation then I suspect there is a bug in profiling related to cgo. Lets look at the following two lines.

         .         21     82:           v := makeBenchDBVal(&rc)
         .     440404     83:           err := txn.Put(dbi, k, v, 0)

So, makeBenchDBVal allocated 21 objects (not 21 bytes, 21 slices). And txn.Put allocated 20,000X more objects. But, because one call always follows the other that means that txn.Put allocated 20000 objects per iteration on average. But as we've seen the numbers say that the benchmark allocated 4 objects per iteration. So my conclusion has to be that somewhere an assumption of how the tools work is wrong or there is a bug in one of the tools.

The number of txn.Put isn't the sum of all calls to Txn.Put is it? It is the sum of calls for which lmdb.BenchmarkTxn_Put.func1 was directly below it in the stack (potentially even more granular that that). Right? If I'm wrong about that then I'm not sure what to take away from the snippet you have provided 😕

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 31, 2016

Oh. I think pprof may be counting allocations using malloc in C as well. If you look at the numbers using -alloc_space you see the size allocated in txn.Put is ~98MB. This may be approximately equal to the total amount of data in the database at the end of the benchmark. (edit: that is probably a bogus statement --- but i think the C malloc thing may be true)

@lmb
Copy link
Contributor Author

lmb commented Mar 31, 2016

The confusing bit is that the output from go test only includes allocations between ResetTimer and StopTimer, but the memprofile includes all allocations.

I removed the makeBenchDBVal call from the benchmark loop, and go test still reports 4 allocations per op. Looking at the memory profile shows that all allocations happen at txn.Put. The numbers for txn.Put (in BenchmarkTxn_Put.func1) do not match up with Txn.Put since we populate the DB prior to running the benchmark.

Maybe this should be a separate issue?

lorenz@antares-3 ~/go/src/github.com/bmatsuo/lmdb-go (reuse-mdb-val)*
$ go test -run XXX -bench 'Put$' -benchmem -memprofile mem.profile -memprofilerate 1 ./lmdb
PASS
BenchmarkTxn_Put-4    200000         12771 ns/op          96 B/op          4 allocs/op
--- BENCH: BenchmarkTxn_Put-4
    bench_test.go:446: initializing random source data
ok      github.com/bmatsuo/lmdb-go/lmdb 3.695s
lorenz@antares-3 ~/go/src/github.com/bmatsuo/lmdb-go (reuse-mdb-val)*
$ go tool pprof -alloc_objects lmdb.test mem.profile
Entering interactive mode (type "help" for commands)
(pprof) list Txn_Put\.func1$
Total: 906361
ROUTINE ======================== github.com/bmatsuo/lmdb-go/lmdb.BenchmarkTxn_Put.func1 in /Users/lorenz/go/src/github.com/bmatsuo/lmdb-go/lmdb/bench_test.go
         4     840408 (flat, cum) 92.72% of Total
         .          .     73:       b.Errorf("populate db: %v", err)
         .          .     74:       return
         .          .     75:   }
         .          .     76:
         .          .     77:   err = env.Update(func(txn *Txn) (err error) {
         4          4     78:       v := make([]byte, 100)
         .          .     79:
         .          .     80:       b.ResetTimer()
         .          .     81:       defer b.StopTimer()
         .          .     82:       for i := 0; i < b.N; i++ {
         .          .     83:           k := ps[rand.Intn(len(ps)/2)*2]
         .          .     84:           // v := makeBenchDBVal(&rc)
         .     840404     85:           err := txn.Put(dbi, k, v, 0)
         .          .     86:           if err != nil {
         .          .     87:               return err
         .          .     88:           }
         .          .     89:       }
         .          .     90:       return nil
(pprof) list \(\*Txn\)\.Put$
Total: 906361
ROUTINE ======================== github.com/bmatsuo/lmdb-go/lmdb.(*Txn).Put in /Users/lorenz/go/src/github.com/bmatsuo/lmdb-go/lmdb/txn.go
    905940     905940 (flat, cum)   100% of Total
         .          .    314:   vn := len(val)
         .          .    315:   if vn == 0 {
         .          .    316:       val = []byte{0}
         .          .    317:   }
         .          .    318:
    905940     905940    319:   ret := C.lmdbgo_mdb_put2(
         .          .    320:       txn._txn, C.MDB_dbi(dbi),
         .          .    321:       unsafe.Pointer(&key[0]), C.size_t(kn),
         .          .    322:       unsafe.Pointer(&val[0]), C.size_t(vn),
         .          .    323:       C.uint(flags),
         .          .    324:   )

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 31, 2016

Umm. Wait.. So you made a change which should have reduced the number of allocs/op as reported by the benchmark output, and the number of allocs/op did not change? This sounds like a bug in "testing" package, then. No?

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 31, 2016

This should probably be in a separate issue regardless.

@bmatsuo
Copy link
Owner

bmatsuo commented Mar 31, 2016

I created #63 for this issue

@bmatsuo bmatsuo modified the milestone: v1.6.0 Mar 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants