Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle embedded quote in mmcif #619

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

0ut0fcontrol
Copy link
Contributor

@0ut0fcontrol 0ut0fcontrol commented Jun 30, 2024

fix #570

use 3 regex patterns to match fields in one line for handle embed quote in mmcif file:

  single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)"
  double_quote_pattern = r'("(?:"(?! )|[^"])*")(?:\s|$)'
  unquoted_pattern = r"([^\s]+)"

GPT4 explain single_quote_pattern:

This regex single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)" is engineered to identify and extract substrings enclosed in single quotes from a larger text, with a particular sensitivity to handle internal apostrophes correctly. Let's dissect this expression to understand how it functions:

  1. ': This matches the opening single quote ' of the target substring.

  2. (?: ... ): This is a non-capturing group, which means it groups the contained pattern parts without storing the matched substring. This is used here mainly for grouping purposes without needing backreferences.

  3. '(?! ): This is a negative lookahead assertion that matches a single quote ' only if it's not immediately followed by a space . This allows the regex to match apostrophes within words (like in contractions such as don't) without treating them as the end of the quoted substring.

  4. |: The logical OR operator presents an alternative within the non-capturing group. It separates the negative lookahead for internal apostrophes from the next part of the pattern.

  5. [^']: This is a negated character class that matches any character except a single quote '. This part of the expression ensures that the regex consumes all characters within the quotes until it encounters the next single quote, which might signify the end of the quoted substring.

  6. *: This quantifier applies to the non-capturing group, allowing the contained pattern to repeat any number of times — including zero times — thus enabling the regex to match quoted substrings of any length.

  7. ': Matches the closing single quote of the substring.

  8. (?:\s|$): Another non-capturing group that operates as a condition for what follows the closing quote. It matches either:

    • \s: A whitespace character, ensuring that the quoted substring is followed by a space, or
    • $: The end of a line or string, allowing for the quoted substring to appear at the end of the text.

The Key Points:

  • The pattern is designed to efficiently target substrings enclosed in single quotes within a larger string or document.
  • It smartly handles situations where an apostrophe is part of the enclosed text (like in contractions) without mistakenly recognizing it as the end of the quoted section.
  • By requiring the quoted substring to be followed by a space or the end of the text, it imposes a sensible boundary condition to identify discrete quoted substrings within a flow of text.

This regex could be particularly useful in text parsing applications where accurately distinguishing between quoted strings and regular text is crucial, such as in natural language processing tasks, data extraction, or in developing syntax highlighters for code editors.

@0ut0fcontrol
Copy link
Contributor Author

@padix-key
I'm sorry, I've been too busy with work and haven't had much time to delve into regex.
Regex can be quite a headache.

Could you take a look at this solution?
I'm not sure if the test covers all scenarios, and I'm thinking of adding more tests.
Do you have any suggestions?

@0ut0fcontrol 0ut0fcontrol marked this pull request as ready for review June 30, 2024 09:54
Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for preparing the fix! I have put a few suggestions into the review.

src/biotite/structure/io/pdbx/cif.py Show resolved Hide resolved
Comment on lines +1002 to +1005
elif len(value) >= 2 and value[0] == value[-1] == "'":
return value
elif len(value) >= 2 and value[0] == value[-1] == '"':
return value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these lines lead to incorrect quoting (correct me if I am wrong). Let's say I want to literally write the string "'value'" into the file. These lines would lead to this string written as ... 'value' ... in the CIF file. When these lines are deserialized, the string would become "value" (without the leading and trailing quotes).

Hence I think it would be correct, if the string would be written as ... "'value'" ... into the CIF file, as done by the code lines below.

src/biotite/structure/io/pdbx/cif.py Show resolved Hide resolved
src/biotite/structure/io/pdbx/cif.py Show resolved Hide resolved
src/biotite/structure/io/pdbx/cif.py Show resolved Hide resolved
Comment on lines +62 to +83
def test_embedded_quote():
"""
Test whether values that have an embedded quote are properly escaped.
"""
text_loop = (
"loop_\n"
"_entity.id\n"
"_entity.type\n"
"_entity.src_method\n"
"_entity.pdbx_description\n"
"_entity.formula_weight\n"
"4 non-polymer syn 'HEXAETHYLENE GLYCOL' 282.331\n"
"""5 non-polymer syn '2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]' 510.816\n"""
)
test_category = pdbx.CIFCategory.deserialize(text_loop)
assert test_category["pdbx_description"].as_array(str).tolist() == [
'HEXAETHYLENE GLYCOL',
"""2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]""",
]
text_single = """_entity.pdbx_description '2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]'\n"""
test_category = pdbx.CIFCategory.deserialize(text_single)
assert test_category["pdbx_description"].as_item() == """2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]"""
Copy link
Member

@padix-key padix-key Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is quite verbose with the necessity of writing an entire CIFCategory. Hence, I propose as alternative to test _split_one_line() directly, by inputting the line and comparing with the expected fields from the line. This way multiple cases can easily tested with pytest.mark.parametrize(). For example:

@pytest.mark.parametrize(...)
def test_split_one_line(cif_line, expected_fields):
    """
    Test whether values that have an embedded quote are properly escaped.
    """
    assert _split_one_line(cif_line) == expected_fields

Probably this solution is also not perfectly ideal, as a private function is tested, but it is shorter and ensures that all fields from a line are parsed correctly.

Additional test cases I can think of:

  • a line containing an empty field (i.e. ... "" ...)
  • a line with a field quote by double quotes
  • a line with a literally quoted value

@0ut0fcontrol
Copy link
Contributor Author

Thank you for your review. I will provide feedback as soon as possible. I will have time in the evening or during the weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to deserialize category 'entity' with ValueError: No closing quotation
2 participants