Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core-clp: Rewrite wildcard matching method and add systematic unit tests (fixes #427). #428

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

kirkrodrigues
Copy link
Member

Description

This PR:

  • rewrites clp::string_utils::wildcard_match_unsafe_case_sensitive both to fix clp::string_utils::wildcard_match_unsafe_case_sensitive fails to match with certain queries #427, to slightly simplify the logic, and to add more comments to explain the algorithm;
  • completely rewrites the unit tests for wildcard matching so that they (hopefully) systematically test all cases;
  • replaces the wildcard performance test to be more realistic by matching against several lines (rather than a single line) from an example log file.

Validation performed

Validated unit tests passed.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed implementation. Overall I think it is now much easier to understand compared to the previous implementation.

components/core/src/clp/string_utils/string_utils.cpp Outdated Show resolved Hide resolved
Comment on lines +237 to +243
// Handle `tame` or `wild` being empty
if (wild.empty()) {
return tame.empty();
}
if (tame.empty()) {
return "*" == wild;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not moving the empty check to the beginning of the function?


// Handle boundary conditions
if (tame_end_it == tame_it) {
return (wild_end_it == wild_it) || (wild_end_it == wild_it + 1 && '*' == *wild_it);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is repeated in the next for loop. How about we make a helper, sth like is_wild_reaching_end_or_trailing_star? (maybe we can discuss to come up with a better name)

// Handle boundary conditions
if (tame_end_it == tame_it) {
return (wild_end_it == wild_it) || (wild_end_it == wild_it + 1 && '*' == *wild_it);
} else if (wild_end_it == wild_it) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think using a new if instead of else if would be more clear? Essentially these are two different cases to handle.

Comment on lines 324 to 325
} else if (wild_end_it == wild_it) {
if (tame_end_it == tame_it) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if tame_end_it == tame_it and wild_end_it == wild_it, we should already return in the above if right?

Comment on lines +329 to +335
// Reset to bookmarks
tame_it = tame_bookmark_it + 1;
wild_it = wild_bookmark_it;
if (false
== advance_tame_to_next_match(tame_end_it, tame_it, tame_bookmark_it, wild_it))
{
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong: if this branch is triggered, wild has already reached the end without consuming the entire tame. We should be handling the last group of tame after the last *. In this case, we only need to match the last n characters (determined by wild_it - wild_bookmark_it, and properly counting escape chars in between) in tame right? For example, if tame is "aaaaaaaa" and wild is "*a", we don't have to advance tame to match every single "a" but jump to match the last one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Are you suggesting, in this case, that we should iterate backwards from the end of tame to see if it matches the last group in wild?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, iterating backwards is non-trivial because of escaped characters. If we see a ? in wild, we have to check the character before it to know if it's an escape character. But even if it is, we don't know if it's escaping the ? or it's preceded by an escape itself. So it's easier to always iterate forwards.

* @param tame_it Returns `tame`'s updated iterator.
* @param tame_bookmark_it Returns `tame`'s updated bookmark.
* @param wild_it Returns `wild`'s updated iterator.
* @return true on success, false if `tame` cannot match `wild`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @return true on success, false if `tame` cannot match `wild`.
* @return Whether `tame` can successfully match `wild`,

*
* NOTE:
* - This method expects that `tame_it` < `tame_end_it`
* - This method should be inlined for performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is enforced. afaik we are just suggesting the compiler to inline the method. But I'm also not sure whether we should use always_intline attribute since we may need to compile this file using none-gnu compilers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't enforced, but at the same time I don't know if forcing it to be inlined is necessary. The reason I said it should be inlined is because in past performance tests, the inline hint did make a difference. Nowadays though, it seems like gcc inlines it regardless of the hint.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two high level questions:

  1. The generation of the wild and tame is deterministic. Can we pre-generate them and add a script used for generation? Or we don't really care about the time cost
  2. Without annotating any baseline runtime or run time comparison, how do we detect performance regression in the updated performance test?

// possibilities---to generate this variety. For each wildcard string, we also generate one or
// more strings that can be matched by the wildcard string.

// The template below is meant to test 1-3 groups of WildcardStringAlphabets separated by '*'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the generated wild, * is added only at beginning three times (since i is set to iterate 3 times). Did we miss to add * in between when creating template_wildcard_str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

return;
}

size_t const tame_size_before_modification = tame.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t const tame_size_before_modification = tame.size();
auto const tame_size_before_modification = tame.size();


size_t const tame_size_before_modification = tame.size();
auto alphabet = chosen_alphabets.front();
auto next_chosen_alphabets = chosen_alphabets.subspan(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto next_chosen_alphabets = chosen_alphabets.subspan(1);
auto const next_chosen_alphabets = chosen_alphabets.subspan(1);

return;
}

size_t const wild_size_before_modification = wild.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t const wild_size_before_modification = wild.size();
autoconst wild_size_before_modification = wild.size();

Comment on lines 210 to 211
string tame1{"a?c"};
string wild1{"*a\\??*"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use constexpr std::string_view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were committed accidentally. I removed them now.

REQUIRE(wildcard_match_unsafe_case_sensitive(tameString, wildString) == true);
}
TEST_CASE("wildcard_match_unsafe", "[wildcard]") {
string tame{"0!2#4%6&8(aBcDeFgHiJkLmNoPqRsTuVwXyZ"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use constexpr std::string_view?

REQUIRE(wildcard_match_unsafe(tameString, wildString, isCaseSensitive) == true);
SCENARIO("wildcard_match_unsafe_case_sensitive performance", "[wildcard performance]") {
auto const tests_dir = std::filesystem::path{__FILE__}.parent_path();
auto log_file_path = tests_dir / "test_network_reader_src" / "random.log";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto log_file_path = tests_dir / "test_network_reader_src" / "random.log";
auto const log_file_path = tests_dir / "test_network_reader_src" / "random.log";

GIVEN("All lower case tame and mixed lower and upper case wild") {
tameString = "abcde", wildString = "A?c*";
REQUIRE(wildcard_match_unsafe(tameString, wildString, isCaseSensitive) == true);
SCENARIO("wildcard_match_unsafe_case_sensitive performance", "[wildcard performance]") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading the doc, we're not using any of GIVEN, THEN, or WHEN. Can we just make it a normal TEST_CASE?

}
}
}
auto end_timestamp = high_resolution_clock::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto end_timestamp = high_resolution_clock::now();
auto const end_timestamp = high_resolution_clock::now();

bool isCaseSensitive = false;
allPassed &= wildcard_match_unsafe("mississippi", "*issip*PI", isCaseSensitive);
REQUIRE(allPassed == true);
auto begin_timestamp = high_resolution_clock::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto begin_timestamp = high_resolution_clock::now();
auto const begin_timestamp = high_resolution_clock::now();

@kirkrodrigues
Copy link
Member Author

  1. The generation of the wild and tame is deterministic. Can we pre-generate them and add a script used for generation? Or we don't really care about the time cost

True. I can move the generation code to a Python script and pregenerate them. With the generation bug fix, it takes something like 15 minutes to finish, so pregeneration is definitely worthwhile.

  1. Without annotating any baseline runtime or run time comparison, how do we detect performance regression in the updated performance test?

The previous performance test didn't detect regressions since it was comparing against itself. I'm a little reluctant to set a hard performance target since that's machine specific. For now, I hope everyone who modifies (and reviews) this code will benchmark the implementation before and after to ensure they aren't introducing regressions.

That said, I do think there's a way to speed up and simplify the implementation, but that's for another PR.

@LinZhihao-723
Copy link
Member

  1. The generation of the wild and tame is deterministic. Can we pre-generate them and add a script used for generation? Or we don't really care about the time cost

True. I can move the generation code to a Python script and pregenerate them. With the generation bug fix, it takes something like 15 minutes to finish, so pregeneration is definitely worthwhile.

  1. Without annotating any baseline runtime or run time comparison, how do we detect performance regression in the updated performance test?

The previous performance test didn't detect regressions since it was comparing against itself. I'm a little reluctant to set a hard performance target since that's machine specific. For now, I hope everyone who modifies (and reviews) this code will benchmark the implementation before and after to ensure they aren't introducing regressions.

That said, I do think there's a way to speed up and simplify the implementation, but that's for another PR.

For 1: sure. I'm ok to delay it to a new PR.
For 2: I think it's worth adding a comment explaining our expectations about benchmarking the before/after performance

@kirkrodrigues
Copy link
Member Author

For 1: sure. I'm ok to delay it to a new PR.

All the cases end up being a 41GiB file, so I decided to just simplify the test case. I think it still covers everything and it's performant enough imo.

For 2: I think it's worth adding a comment explaining our expectations about benchmarking the before/after performance

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clp::string_utils::wildcard_match_unsafe_case_sensitive fails to match with certain queries
2 participants