You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I'm using unstructured to partition pdfs and extract tables. I'm using the hi_res strategy and yolox model, and still a lot of the images and tables are being extracted with the edges cut off resulting in a loss of information. And this is true for both extracted tables as images and recognized text content where the first letter of each row would be missing from the html table code block.
I am aware of environment variables EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD and EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD and have experimented with them. Adjusting them only affect the extracted image, where with large paddings, I was able to obtain complete screenshots of the tables, but the recognized texts (as html table block) were still missing the first letter of each row. Is this intended behavior, i.e., these padding env variables only affect the crop box used to crop the tables out as images, but not the text recognition?
I am aware of this issue but it does not address this problem.
For example, with both padding set to 50:
the extracted table as image looks like this:
and recognized table texts as html table looks like this after rendered in browser:
The text was updated successfully, but these errors were encountered:
Describe the bug
I'm using unstructured to partition pdfs and extract tables. I'm using the
hi_res
strategy andyolox
model, and still a lot of the images and tables are being extracted with the edges cut off resulting in a loss of information. And this is true for both extracted tables as images and recognized text content where the first letter of each row would be missing from the html table code block.I am aware of environment variables
EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD
andEXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD
and have experimented with them. Adjusting them only affect the extracted image, where with large paddings, I was able to obtain complete screenshots of the tables, but the recognized texts (as html table block) were still missing the first letter of each row. Is this intended behavior, i.e., these padding env variables only affect the crop box used to crop the tables out as images, but not the text recognition?I am aware of this issue but it does not address this problem.
For example, with both padding set to 50:
the extracted table as image looks like this:
![image](https://private-user-images.githubusercontent.com/29267183/341519760-966eca9c-f472-41c7-a15c-fcbcec934dbe.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk5ODQwNzMsIm5iZiI6MTcxOTk4Mzc3MywicGF0aCI6Ii8yOTI2NzE4My8zNDE1MTk3NjAtOTY2ZWNhOWMtZjQ3Mi00MWM3LWExNWMtZmNiY2VjOTM0ZGJlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzAzVDA1MTYxM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ0YzliOGU2ODY5MDNiMTE5ZDJmYWQ0MTkzMjBkNGU0Y2RlZTQwYjk1ZWM0YTYzMmRkZmJlNDc1ZWE2ZjY1NjcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.fAI4lMzCljyhOrFMqBzEHbl0m-Ld-hZLJs8YcHATpn8)
and recognized table texts as html table looks like this after rendered in browser:
The text was updated successfully, but these errors were encountered: