Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/<tables getting cut off at the edges when using hi res strategy> #3262

Closed
rchen19 opened this issue Jun 20, 2024 · 2 comments
Closed

bug/<tables getting cut off at the edges when using hi res strategy> #3262

rchen19 opened this issue Jun 20, 2024 · 2 comments
Labels
bug Something isn't working pdf table

Comments

@rchen19
Copy link

rchen19 commented Jun 20, 2024

Describe the bug
I'm using unstructured to partition pdfs and extract tables. I'm using the hi_res strategy and yolox model, and still a lot of the images and tables are being extracted with the edges cut off resulting in a loss of information. And this is true for both extracted tables as images and recognized text content where the first letter of each row would be missing from the html table code block.

I am aware of environment variables EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD and EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD and have experimented with them. Adjusting them only affect the extracted image, where with large paddings, I was able to obtain complete screenshots of the tables, but the recognized texts (as html table block) were still missing the first letter of each row. Is this intended behavior, i.e., these padding env variables only affect the crop box used to crop the tables out as images, but not the text recognition?

I am aware of this issue but it does not address this problem.

For example, with both padding set to 50:

the extracted table as image looks like this:
image

and recognized table texts as html table looks like this after rendered in browser:

image
@rchen19 rchen19 added the bug Something isn't working label Jun 20, 2024
@scanny scanny added the pdf label Jun 20, 2024
@christinestraub
Copy link
Contributor

Similar to #2997, recommend using the API for access to higher performance table extraction models.

@MthwRobinson
Copy link
Contributor

Per @christinestraub 's suggestion, please try our new serverless API if you need better performance for table extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pdf table
Projects
None yet
Development

No branches or pull requests

4 participants