-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle embedded quote in mmcif #619
base: main
Are you sure you want to change the base?
Conversation
…split with _split_one_line() using regex.
@padix-key Could you take a look at this solution? |
a09e8ab
to
23f4e2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for preparing the fix! I have put a few suggestions into the review.
elif len(value) >= 2 and value[0] == value[-1] == "'": | ||
return value | ||
elif len(value) >= 2 and value[0] == value[-1] == '"': | ||
return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these lines lead to incorrect quoting (correct me if I am wrong). Let's say I want to literally write the string "'value'"
into the file. These lines would lead to this string written as ... 'value' ...
in the CIF file. When these lines are deserialized, the string would become "value"
(without the leading and trailing quotes).
Hence I think it would be correct, if the string would be written as ... "'value'" ...
into the CIF file, as done by the code lines below.
def test_embedded_quote(): | ||
""" | ||
Test whether values that have an embedded quote are properly escaped. | ||
""" | ||
text_loop = ( | ||
"loop_\n" | ||
"_entity.id\n" | ||
"_entity.type\n" | ||
"_entity.src_method\n" | ||
"_entity.pdbx_description\n" | ||
"_entity.formula_weight\n" | ||
"4 non-polymer syn 'HEXAETHYLENE GLYCOL' 282.331\n" | ||
"""5 non-polymer syn '2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]' 510.816\n""" | ||
) | ||
test_category = pdbx.CIFCategory.deserialize(text_loop) | ||
assert test_category["pdbx_description"].as_array(str).tolist() == [ | ||
'HEXAETHYLENE GLYCOL', | ||
"""2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]""", | ||
] | ||
text_single = """_entity.pdbx_description '2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]'\n""" | ||
test_category = pdbx.CIFCategory.deserialize(text_single) | ||
assert test_category["pdbx_description"].as_item() == """2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is quite verbose with the necessity of writing an entire CIFCategory
. Hence, I propose as alternative to test _split_one_line()
directly, by inputting the line and comparing with the expected fields from the line. This way multiple cases can easily tested with pytest.mark.parametrize()
. For example:
@pytest.mark.parametrize(...)
def test_split_one_line(cif_line, expected_fields):
"""
Test whether values that have an embedded quote are properly escaped.
"""
assert _split_one_line(cif_line) == expected_fields
Probably this solution is also not perfectly ideal, as a private function is tested, but it is shorter and ensures that all fields from a line are parsed correctly.
Additional test cases I can think of:
- a line containing an empty field (i.e.
... "" ...
) - a line with a field quote by double quotes
- a line with a literally quoted value
Thank you for your review. I will provide feedback as soon as possible. I will have time in the evening or during the weekend. |
fix #570
use 3 regex patterns to match fields in one line for handle embed quote in mmcif file:
GPT4 explain single_quote_pattern: