Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor query and search for better performance #674

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jonasseglare
Copy link
Collaborator

@jonasseglare jonasseglare commented May 9, 2024

This code mainly refactors the namespaces datahike.query and datahike.db.search to improve performance of the query engine. This means that we remove the previous "relprod strategies" and the algorithm in this PR best resembles the "select-all" strategy that we had before. I believe there are many things that can contribute to the speed improvement, but this refactoring means that work is moved out from the innermost loop of the query engine and this loop now consists of a transduction inside search-batch-fn. Using transducers probably avoid the creation of lots of short-lived intermediate values and lazy sequences and seem to contribute to the performance. In some cases I use macros to move work out of loops and that comes at a cost in terms of readability. I explored simpler ways of writing the code but in the places where I do use macros to generate code to avoid work at runtime, they seem to be justified, see for example the comments in the function single-substitution-xform.

Here are the results. The results for this PR is labeled Datahike new query/search in the table below:

TARGET                      ABS TIME (s)   REL TIME
Some other db                      4.471         6%
Datahike 0.6.1558                 74.294       100%
Datahike 0.6.1559                 51.827        70%
Datahike new query/search         26.377        36%

The code to run the benchmark is found at https://gitlab.com/arbetsformedlingen/taxonomy-dev/backend/experimental/datahike-benchmark/.

I believe @whilo will want to review this code, he knows what it is about.


(defn single-substitution-xform
"Returns a transducer that substitutes the symbols for a single relation."
[search-context relation-index subst-map filt-map]
Copy link
Collaborator Author

@jonasseglare jonasseglare May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should think about better names instead of subst-map and filt-map. This code is a bit complicated so better names would help for the sake of readability. And also provide more comments about what is happening algorithmically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that helps in general, but it is also ok if they are explained where they are introduced, or renamed to shorthands in internal scopes.

Copy link
Member

@whilo whilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, the search changes make sense, but I still need another round of review to go through the query namespace and macros in more detail. Lmk if you would like to get feedback on something specific as well. I need to understand what is happening and you are right that the code is a bit dense and might get harder to pick up, so if we can make it clearer that is also helpful. Nonetheless I think this should already be mergeable, I just want to make sure that we get things right.

(.after ^Date (.-v d) ^Date time-point)
(>= (.-tx d) time-point)))
filtered-tx-ids (dbu/filter-txInstant datoms since-pred db)]
(into []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a problem that it is not lazy anymore now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, but right now I cannot see that it would be a problem.

src/datahike/db/search.cljc Show resolved Hide resolved
(into [] (distinct-tuples) tuples))
([]
(let [seen (HashSet.)]
(filter #(.add ^HashSet seen (vec %))))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was beneficial for performance compared to persistent set. The question is what happens if the returned lazy seq is shared between threads. I think it might be fine because adding to the set is monotonic, but it could be that it can return duplicates then. Alternatively we could ensure synchronization https://www.baeldung.com/java-concurrent-hashset-concurrenthashmap#thread-safe-hashset-using-collections-utility-class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! I wanted this function to be usable also as a transducer constructed from the expression (distinct-tuples) and this (filter ...) expression returns a filtering transducer that has a stateful predicate. In the two places where we use this transducer we use it eagerly so I don't think we will have any issues because of that.

I thought using a HashSet to implement this predicate was an elegant way of doing it. But I now realize it makes the code less portable to ClojureScript. So I can spend some time to see if we can have a pure Clojure solution without Java interop that does not sacrifice performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with a pure Clojure solution that internally uses clojure.core/distinct.


(defn single-substitution-xform
"Returns a transducer that substitutes the symbols for a single relation."
[search-context relation-index subst-map filt-map]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that helps in general, but it is also ok if they are explained where they are introduced, or renamed to shorthands in internal scopes.

(sum-rel a)
(sum-rel b))))))
(defn sum-rel
([a] a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I don't think we need it.

I believe the reason why it was there is probably that I wanted to use sum-rel as a step function for a transducer and that would require this arity. But I probably dropped that approach as it would have made this PR less focused.

@whilo
Copy link
Member

whilo commented Jun 17, 2024

Can you rebase this? I hope it is not too annoying.

@jonasseglare
Copy link
Collaborator Author

Thanks @whilo for the review so far! I have rebased and also improved the implementation of distinct-tuples as you requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants