core-clp: Rewrite wildcard matching method and add systematic unit tests (fixes #427). #428

kirkrodrigues · 2024-06-07T00:09:41Z

Description

This PR:

rewrites clp::string_utils::wildcard_match_unsafe_case_sensitive both to fix clp::string_utils::wildcard_match_unsafe_case_sensitive fails to match with certain queries #427, to slightly simplify the logic, and to add more comments to explain the algorithm;
completely rewrites the unit tests for wildcard matching so that they (hopefully) systematically test all cases;
replaces the wildcard performance test to be more realistic by matching against several lines (rather than a single line) from an example log file.

Validation performed

Validated unit tests passed.

…sts (fixes y-scope#427).

LinZhihao-723

Reviewed implementation. Overall I think it is now much easier to understand compared to the previous implementation.

components/core/src/clp/string_utils/string_utils.cpp

LinZhihao-723 · 2024-06-07T06:15:13Z

components/core/src/clp/string_utils/string_utils.cpp

+    // Handle `tame` or `wild` being empty
+    if (wild.empty()) {
+        return tame.empty();
+    }
+    if (tame.empty()) {
+        return "*" == wild;
+    }


why not moving the empty check to the beginning of the function?

LinZhihao-723 · 2024-06-07T06:23:37Z

components/core/src/clp/string_utils/string_utils.cpp

+
+        // Handle boundary conditions
+        if (tame_end_it == tame_it) {
+            return (wild_end_it == wild_it) || (wild_end_it == wild_it + 1 && '*' == *wild_it);


This line is repeated in the next for loop. How about we make a helper, sth like is_wild_reaching_end_or_trailing_star? (maybe we can discuss to come up with a better name)

LinZhihao-723 · 2024-06-07T06:25:29Z

components/core/src/clp/string_utils/string_utils.cpp

+        // Handle boundary conditions
+        if (tame_end_it == tame_it) {
+            return (wild_end_it == wild_it) || (wild_end_it == wild_it + 1 && '*' == *wild_it);
+        } else if (wild_end_it == wild_it) {


Do you think using a new if instead of else if would be more clear? Essentially these are two different cases to handle.

LinZhihao-723 · 2024-06-07T06:33:29Z

components/core/src/clp/string_utils/string_utils.cpp

+        } else if (wild_end_it == wild_it) {
+            if (tame_end_it == tame_it) {


if tame_end_it == tame_it and wild_end_it == wild_it, we should already return in the above if right?

LinZhihao-723 · 2024-06-07T06:39:53Z

components/core/src/clp/string_utils/string_utils.cpp

+            // Reset to bookmarks
+            tame_it = tame_bookmark_it + 1;
+            wild_it = wild_bookmark_it;
+            if (false
+                == advance_tame_to_next_match(tame_end_it, tame_it, tame_bookmark_it, wild_it))
+            {
+                return false;


Correct me if I'm wrong: if this branch is triggered, wild has already reached the end without consuming the entire tame. We should be handling the last group of tame after the last *. In this case, we only need to match the last n characters (determined by wild_it - wild_bookmark_it, and properly counting escape chars in between) in tame right? For example, if tame is "aaaaaaaa" and wild is "*a", we don't have to advance tame to match every single "a" but jump to match the last one

You're right. Are you suggesting, in this case, that we should iterate backwards from the end of tame to see if it matches the last group in wild?

Actually, iterating backwards is non-trivial because of escaped characters. If we see a ? in wild, we have to check the character before it to know if it's an escape character. But even if it is, we don't know if it's escaping the ? or it's preceded by an escape itself. So it's easier to always iterate forwards.

LinZhihao-723 · 2024-06-07T06:44:49Z

components/core/src/clp/string_utils/string_utils.cpp

+ * @param tame_it Returns `tame`'s updated iterator.
+ * @param tame_bookmark_it Returns `tame`'s updated bookmark.
+ * @param wild_it Returns `wild`'s updated iterator.
+ * @return true on success, false if `tame` cannot match `wild`.


Suggested change

* @return true on success, false if `tame` cannot match `wild`.

* @return Whether `tame` can successfully match `wild`,

LinZhihao-723 · 2024-06-07T06:49:22Z

components/core/src/clp/string_utils/string_utils.cpp

+ *
+ * NOTE:
+ * - This method expects that `tame_it` < `tame_end_it`
+ * - This method should be inlined for performance.


I'm not sure if this is enforced. afaik we are just suggesting the compiler to inline the method. But I'm also not sure whether we should use always_intline attribute since we may need to compile this file using none-gnu compilers.

It isn't enforced, but at the same time I don't know if forcing it to be inlined is necessary. The reason I said it should be inlined is because in past performance tests, the inline hint did make a difference. Nowadays though, it seems like gcc inlines it regardless of the hint.

LinZhihao-723

Two high level questions:

The generation of the wild and tame is deterministic. Can we pre-generate them and add a script used for generation? Or we don't really care about the time cost
Without annotating any baseline runtime or run time comparison, how do we detect performance regression in the updated performance test?

LinZhihao-723 · 2024-06-07T07:48:25Z