Skip to content
This repository has been archived by the owner on Nov 29, 2023. It is now read-only.

Use of Native Xpath - Faster calculate when big data #906

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mgogh
Copy link

@mgogh mgogh commented Jun 10, 2022

If you create a form with select_one_from_file based on big csv file (more than 3000 rows and 10 columns), then you make ten calculate for each columns, it will be very slow. (the select_one was filtered by an other question).

Exemple : In France, we have 34 000 communes, we can filter it with region and department.

Trying to always use Xpath native approach improve the value search when model are big.
If expr contains custom OpenRosa functions, it will use the fork as expected (jsEvaluate) which is slower than the native method.

@eyelidlessness
Copy link
Contributor

Hi @mgogh, thanks for this PR. I arrived at a similar approach in a project I'm using for exploring/prototyping a variety of performance improvements, but hadn't gotten around to getting it into a PR. In my prototype, I arrived at a few additional optimizations.

  1. I found that pre-compiling expressions with document.createExpression performs better than calling document.evaluate directly. (Surprisingly, it performs better even if you discard the expression and recompile on each evaluation, at least in Chrome.)

  2. Wrapping both cases in a class of the same shape performs better still, and catching the error on construction rather than evaluation performs much better still. This is partly because classes with a consistent shape are good JIT optimization targets, and partly because the try/catch branching is much more minimal and predictable.

I'll take some time to bring the pertinent prototype code in for a PR. In the meantime, would you be able to share a form like the one you described? I'd like to add it to my collection of performance-related forms.

@@ -1561,30 +1561,18 @@ FormModel.prototype.evaluate = function (
}

// try native to see if that works... (will not work if the expr contains custom OpenRosa functions)
if (
tryNative &&
Copy link
Member

@MartijnR MartijnR Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, it is unfortunately not safe to always try native because in this ecosystem we do things that can be evaluated by a native evaluator but return an incorrect result for ODK XForms. I'm surprised no tests failed though (or did they?).

The main (and perhaps only) things are comparisons and arithmetic with date/dateTime strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is a shame. I see a few other cases called out in the spec. This feels like something we could maybe revisit with the tree-sitter-xpath grammar, which is fast enough that we'd still see significant performance improvements.

Copy link
Member

@MartijnR MartijnR Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a huge shame. We've had discussions about a turbo mode that requires form designers to wrap such strings with date and date-time.

Yes, indeed, good find. There are some native XPath 1.0. functions that have deviating behavior so some of those would also be an issue.

Curious about tree-sitter! Will it be able to find out if the value of /path/to/node is a date string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alone, the tree-sitter grammar won't be able to do any kind of static type analysis, that's not available in the AST. Here's ways I'd imagine it could help for this case:

  • Rule out expressions with functions we know deviate (although we can skip some because they'll fail to compile, which should perform similarly).
  • Rule out expressions where we know an argument position is of date/date-time type.
  • Relate nodeset expressions to their bindings, to identify their type and rule out expressions with those nodesets1.
  • Rule out expressions with literals which would be treated as dates.

If all of this sounds like it has overlap with openrosa-xpath-evaluator's parsing responsibilities... it does, heh. But it would probably be a good fit for this case, because tree-sitter is exceptionally fast.

Footnotes

  1. This dovetails with other prototype work I've explored, identifying nodeset subexpressions to determine their dependencies. This already works for a huge set of expressions I pulled from openrosa-xpath-evaluator's tests, you can see the test fixtures used to validating that here. The grammar has proven pretty reliable so far. You can see example usage here and here to find nodeset sub-expressions. I have additional prototype work (currently only local) for resolving those sub-expressions to actual related nodes, which so far has worked for everything I've tried except with relative nested-subexpressions (e.g. in a predicate).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff!

I have additional prototype work (currently only local) for resolving those sub-expressions to actual related nodes, which so far has worked for everything

I'm guessing the challenge might be to prevent this from becoming too costly (possibly negate performance improvements achieved by sending to the native evaluator). Will be awesome if that is possible!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s more than possible, it’s a reality! I’ll push up instructions for running and measuring the subexpression logic when I get a chance, but for now I’ll just say that finding subexpressions in all of the cases I pulled from the current evaluator test suite averages 1ms or less even when my computer is throttling under heavy load. In my local stress testing, the native evaluator is also generally ~1ms, the extended evaluator is generally 10+ms.

@jdugh
Copy link
Contributor

jdugh commented Jun 11, 2022

Hi @eyelidlessness ,
Here, a form with a list of avg 40 000 rows (French municipalities) and 11 calcuation.
https://ee.kobotoolbox.org/Cijpflhc
the form takes a long time to load the data (be patient)

The XLSForm and datas :
XLSFORM_big_list.xlsx
communes.csv
departement.csv
region.csv

Try a smaller list (~1 300 rows) :
https://ee.kobotoolbox.org/1CmdNbQ0
communes_light.csv
XLSFORM_small_list.xlsx
(same departement/region)

@eyelidlessness
Copy link
Contributor

Thank you @jdugh! I meant to reply earlier, but wound up on a yak shaving adventure trying to get the large CSV to load on my local enketo-express/ODK central setup.

That aside, this is an awesome case to add to my growing collection of performance stress tests.

Earlier today @lognaturel and I discussed a safer, more limited approach to this. Instead of always deferring to the native evaluator, or doing more complex analysis of queries, we'll likely start with a more naive analysis to optimize queries which are obviously straightforward (nodeset references, basic operators with non-ambiguous operands). This isn’t the end of the line for optimization potential I’m exploring, but it will be a big perf boost for a lot of common cases and a lot of the groundwork is already laid.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants