bug/<tables getting cut off at the edges when using hi res strategy> #3262

rchen19 · 2024-06-20T18:15:54Z

Describe the bug
I'm using unstructured to partition pdfs and extract tables. I'm using the hi_res strategy and yolox model, and still a lot of the images and tables are being extracted with the edges cut off resulting in a loss of information. And this is true for both extracted tables as images and recognized text content where the first letter of each row would be missing from the html table code block.

I am aware of environment variables EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD and EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD and have experimented with them. Adjusting them only affect the extracted image, where with large paddings, I was able to obtain complete screenshots of the tables, but the recognized texts (as html table block) were still missing the first letter of each row. Is this intended behavior, i.e., these padding env variables only affect the crop box used to crop the tables out as images, but not the text recognition?

I am aware of this issue but it does not address this problem.

For example, with both padding set to 50:

the extracted table as image looks like this:

and recognized table texts as html table looks like this after rendered in browser:

The text was updated successfully, but these errors were encountered:

christinestraub · 2024-06-20T19:03:22Z

Similar to #2997, recommend using the API for access to higher performance table extraction models.

MthwRobinson · 2024-07-01T15:32:08Z

Per @christinestraub 's suggestion, please try our new serverless API if you need better performance for table extraction.

rchen19 added the bug Something isn't working label Jun 20, 2024

scanny added the pdf label Jun 20, 2024

christinestraub added the table label Jun 20, 2024

MthwRobinson closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/<tables getting cut off at the edges when using hi res strategy> #3262

bug/<tables getting cut off at the edges when using hi res strategy> #3262

rchen19 commented Jun 20, 2024 •

edited

Loading

christinestraub commented Jun 20, 2024

MthwRobinson commented Jul 1, 2024

bug/<tables getting cut off at the edges when using hi res strategy> #3262

bug/<tables getting cut off at the edges when using hi res strategy> #3262

Comments

rchen19 commented Jun 20, 2024 • edited Loading

christinestraub commented Jun 20, 2024

MthwRobinson commented Jul 1, 2024

rchen19 commented Jun 20, 2024 •

edited

Loading