support different extraction modes #106
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, all white space characters in a textbox are merged into a single space character (
' '
)This makes it very difficult to extract tabular data.
Here, I propose to introduce an extraction mode parameter that allows the user to chose between three extraction modes.
:spaces
(default)all white spaces are handled as a single space character
:tabs
non-space white spaces are handled as tab characters
:boxes
text between non-space white spaces is split into several textboxes with respective coordinates
For this purpose
get_TextBox()
no longer returns a tupletext, w, h
but a vector of tuplestext, w, h, offset
.During
evalContent!()
the vector is itereated to return aTextLayout
for each set of box parameters.For the modes
:spaces
and:tabs
get_TextBox()always returns a single-element vector, whereas in
:boxes` mode more than one TextLayout might be added to the output.The
:spaces
mode reproduces the current extraction behavior.The
:tab
mode is suited for extraction of "well-behaved" tabular data, i.e. no empty cells or at least a space characterThe
:boxes
mode is essential to extract tables that contain empty cells. In that case further textbox treatment is necessary, which I would provide in a separate PR.@sambitdash Please comment if this sounds like a desired feature to you.
If so, we can still discuss whether control via a global variable is the best choice or whether we'd rather implement a keyword arg which is passed through the text extraction function chain.