-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor query and search for better performance #674
base: main
Are you sure you want to change the base?
Refactor query and search for better performance #674
Conversation
src/datahike/query.cljc
Outdated
|
||
(defn single-substitution-xform | ||
"Returns a transducer that substitutes the symbols for a single relation." | ||
[search-context relation-index subst-map filt-map] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should think about better names instead of subst-map
and filt-map
. This code is a bit complicated so better names would help for the sake of readability. And also provide more comments about what is happening algorithmically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that helps in general, but it is also ok if they are explained where they are introduced, or renamed to shorthands in internal scopes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, the search changes make sense, but I still need another round of review to go through the query namespace and macros in more detail. Lmk if you would like to get feedback on something specific as well. I need to understand what is happening and you are right that the code is a bit dense and might get harder to pick up, so if we can make it clearer that is also helpful. Nonetheless I think this should already be mergeable, I just want to make sure that we get things right.
(.after ^Date (.-v d) ^Date time-point) | ||
(>= (.-tx d) time-point))) | ||
filtered-tx-ids (dbu/filter-txInstant datoms since-pred db)] | ||
(into [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a problem that it is not lazy anymore now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, but right now I cannot see that it would be a problem.
src/datahike/query.cljc
Outdated
(into [] (distinct-tuples) tuples)) | ||
([] | ||
(let [seen (HashSet.)] | ||
(filter #(.add ^HashSet seen (vec %)))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was beneficial for performance compared to persistent set. The question is what happens if the returned lazy seq is shared between threads. I think it might be fine because adding to the set is monotonic, but it could be that it can return duplicates then. Alternatively we could ensure synchronization https://www.baeldung.com/java-concurrent-hashset-concurrenthashmap#thread-safe-hashset-using-collections-utility-class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! I wanted this function to be usable also as a transducer constructed from the expression (distinct-tuples)
and this (filter ...)
expression returns a filtering transducer that has a stateful predicate. In the two places where we use this transducer we use it eagerly so I don't think we will have any issues because of that.
I thought using a HashSet
to implement this predicate was an elegant way of doing it. But I now realize it makes the code less portable to ClojureScript. So I can spend some time to see if we can have a pure Clojure solution without Java interop that does not sacrifice performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with a pure Clojure solution that internally uses clojure.core/distinct
.
src/datahike/query.cljc
Outdated
|
||
(defn single-substitution-xform | ||
"Returns a transducer that substitutes the symbols for a single relation." | ||
[search-context relation-index subst-map filt-map] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that helps in general, but it is also ok if they are explained where they are introduced, or renamed to shorthands in internal scopes.
src/datahike/query.cljc
Outdated
(sum-rel a) | ||
(sum-rel b)))))) | ||
(defn sum-rel | ||
([a] a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I don't think we need it.
I believe the reason why it was there is probably that I wanted to use sum-rel
as a step function for a transducer and that would require this arity. But I probably dropped that approach as it would have made this PR less focused.
Can you rebase this? I hope it is not too annoying. |
30537ad
to
47725f4
Compare
Thanks @whilo for the review so far! I have rebased and also improved the implementation of |
This code mainly refactors the namespaces
datahike.query
anddatahike.db.search
to improve performance of the query engine. This means that we remove the previous "relprod strategies" and the algorithm in this PR best resembles the "select-all" strategy that we had before. I believe there are many things that can contribute to the speed improvement, but this refactoring means that work is moved out from the innermost loop of the query engine and this loop now consists of a transduction insidesearch-batch-fn
. Using transducers probably avoid the creation of lots of short-lived intermediate values and lazy sequences and seem to contribute to the performance. In some cases I use macros to move work out of loops and that comes at a cost in terms of readability. I explored simpler ways of writing the code but in the places where I do use macros to generate code to avoid work at runtime, they seem to be justified, see for example the comments in the functionsingle-substitution-xform
.Here are the results. The results for this PR is labeled
Datahike new query/search
in the table below:The code to run the benchmark is found at https://gitlab.com/arbetsformedlingen/taxonomy-dev/backend/experimental/datahike-benchmark/.
I believe @whilo will want to review this code, he knows what it is about.