Skip to content

Commit

Permalink
chore/change default split page behavior to true (#118)
Browse files Browse the repository at this point in the history
* Set the split_pdf_page default to true and run `make client-generate`
locally.
* Update the readme, add another reference back to our docs
* Change some warning logs to info. The user should not be warned about
default behavior for non pdf files

# Testing
Use the client locally and verify that split mode is the default, and
that the client behavior is consistent with older versions.

* Set up (or activate) your pyenv for the client: `pyenv virtualenv 3.12
unstructured-client; pyenv activate unstructured-client`
* Check out this branch and install: `pip install -e .`
* Run this sample script in the top level of the client repo. Try
different files in `_sample_docs` and verify that the logging and
results look acceptable.

```
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared, operations

import json

api_key = "free-api-key"
filename = "_sample_docs/layout-parser-paper.pdf"

s = UnstructuredClient(
    api_key_auth=api_key,
)

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = operations.PartitionRequest(
    shared.PartitionParameters(
        files=files,
        strategy=shared.Strategy.AUTO
    ),
)

try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements, indent=4))
except Exception as e:
    print(e)
```
  • Loading branch information
awalker4 committed Jun 17, 2024
1 parent c55e721 commit eabf116
Show file tree
Hide file tree
Showing 10 changed files with 16 additions and 15 deletions.
6 changes: 3 additions & 3 deletions .speakeasy/gen.lock
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
lockVersion: 2.0.0
id: 8b5fa338-9106-4734-abf0-e30d67044a90
management:
docChecksum: 5365c99c52e23b044ef9916ecf51b1a9
docChecksum: c7e23b3b8242eb21eccb2091bcc57c72
docVersion: 1.0.35
speakeasyVersion: 1.308.1
generationVersion: 2.342.6
releaseVersion: 0.23.5
configChecksum: e210d7bff3afd386269cb7c6adeef630
releaseVersion: 0.23.6
configChecksum: 4e2e510c7f4b61e04b61acf7de2939a3
repoURL: https://github.com/Unstructured-IO/unstructured-python-client.git
repoSubDirectory: .
installationURL: https://github.com/Unstructured-IO/unstructured-python-client.git
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,9 @@ Refer to the [API parameters page](https://docs.unstructured.io/api-reference/ap

#### Splitting PDF by pages

In order to speed up processing of long PDF files, `split_pdf_page` can be set to `True` (defaults to `False`). It will cause the PDF to be split at client side, before sending to API, and combining individual responses as single result. This parameter will affect only PDF files, no need to disable it for other filetypes.
See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details.

In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `split_pdf_page` can be set to `False` to disable this.

The amount of workers utilized for splitting PDFs is dictated by the `split_pdf_concurrency_level` parameter, with a default of 5 and a maximum of 15 to keep resource usage and costs in check. The splitting process leverages `asyncio` to manage concurrency effectively.
The size of each batch of pages (ranging from 2 to 20) is internally determined based on the concurrency level and the total number of pages in the document. Because the splitting process uses `asyncio` the client can encouter event loop issues if it is nested in another async runner, like running in a `gevent` spawned task. Instead, this is safe to run in multiprocessing workers (e.g., using `multiprocessing.Pool` with `fork` context).
Expand All @@ -83,7 +85,6 @@ req = shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_page=True,
split_pdf_concurrency_level=8
)
```
Expand Down
2 changes: 1 addition & 1 deletion _test_unstructured_client/unit/test_split_pdf_hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ def test_unit_is_pdf_invalid_extension(caplog):
"""Test is pdf method returns False for file with invalid extension."""
file = shared.Files(b"txt_content", "test_file.txt")

with caplog.at_level(logging.WARNING):
with caplog.at_level(logging.INFO):
result = pdf_utils.is_pdf(file)

assert result is False
Expand Down
2 changes: 1 addition & 1 deletion gen.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ generation:
auth:
oAuth2ClientCredentialsEnabled: false
python:
version: 0.23.5
version: 0.23.6
additionalDependencies:
dependencies:
deepdiff: '>=6.0'
Expand Down
2 changes: 1 addition & 1 deletion overlay_client.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ actions:
"type": "boolean",
"title": "Split Pdf Page",
"description": "This parameter determines if the PDF file should be split on the client side. It's an internal parameter for the Python client and is not sent to the backend.",
"default": false,
"default": true,
}
- target: $["components"]["schemas"]["partition_parameters"]["properties"]
update:
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

setuptools.setup(
name='unstructured-client',
version='0.23.5',
version='0.23.6',
author='Unstructured',
description='Python Client SDK for Unstructured API',
license = 'MIT',
Expand Down
2 changes: 1 addition & 1 deletion src/unstructured_client/_hooks/custom/pdf_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def is_pdf(file: shared.Files) -> bool:
True if the file is a PDF, False otherwise.
"""
if not file.file_name.endswith(".pdf"):
logger.warning("Given file doesn't have '.pdf' extension. Continuing without splitting.")
logger.info("Given file doesn't have '.pdf' extension, so splitting is not enabled.")
return False

try:
Expand Down
4 changes: 2 additions & 2 deletions src/unstructured_client/_hooks/custom/split_pdf_hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ def before_request(
or not isinstance(file, shared.Files)
or not pdf_utils.is_pdf(file)
):
logger.warning("File could not be split. Partitioning without split.")
logger.info("Partitioning without split.")
return request

starting_page_number = form_utils.get_starting_page_number(
Expand All @@ -160,7 +160,7 @@ def before_request(
logger.info("Determined optimal split size of %d pages.", split_size)

if split_size >= len(pdf.pages):
logger.warning(
logger.info(
"Document has too few pages (%d) to be split efficiently. Partitioning without split.",
len(pdf.pages),
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ class PartitionParameters:
r"""The document types that you want to skip table extraction with. Default: []"""
split_pdf_concurrency_level: Optional[int] = dataclasses.field(default=5, metadata={'multipart_form': { 'field_name': 'split_pdf_concurrency_level' }})
r"""When `split_pdf_page` is set to `True`, this parameter specifies the number of workers used for sending requests when the PDF is split on the client side. It's an internal parameter for the Python client and is not sent to the backend."""
split_pdf_page: Optional[bool] = dataclasses.field(default=False, metadata={'multipart_form': { 'field_name': 'split_pdf_page' }})
split_pdf_page: Optional[bool] = dataclasses.field(default=True, metadata={'multipart_form': { 'field_name': 'split_pdf_page' }})
r"""This parameter determines if the PDF file should be split on the client side. It's an internal parameter for the Python client and is not sent to the backend."""
starting_page_number: Optional[int] = dataclasses.field(default=None, metadata={'multipart_form': { 'field_name': 'starting_page_number' }})
r"""When PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27."""
Expand Down
4 changes: 2 additions & 2 deletions src/unstructured_client/sdkconfiguration.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ class SDKConfiguration:
server: Optional[str] = ''
language: str = 'python'
openapi_doc_version: str = '1.0.35'
sdk_version: str = '0.23.5'
sdk_version: str = '0.23.6'
gen_version: str = '2.342.6'
user_agent: str = 'speakeasy-sdk/python 0.23.5 2.342.6 1.0.35 unstructured-client'
user_agent: str = 'speakeasy-sdk/python 0.23.6 2.342.6 1.0.35 unstructured-client'
retry_config: Optional[RetryConfig] = None

def __post_init__(self):
Expand Down

0 comments on commit eabf116

Please sign in to comment.