Some code I've written when learning what is Textract and how to use it.
The project also contains a shell script which uses the AWS CLI v2 to perform the same task.
-
Put your invoices in the
demo_data
directory. Here's an example of the directory structure:. ├── demo_data │ ├── invoice.pdf │ └── invoices │ ├── other_invoice.jpg │ └── and_one_more_invoice.png ├── readme.md └── src └── main.py
-
Provide your AWS credentials as environment variables:
$ export AWS_ACCESS_KEY_ID=your_access_key_id # for example "AKIAIOSFODNN7EXAMPLE" $ export AWS_SECRET_ACCESS_KEY=your_secret_access_key # for example "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" $ export AWS_REGION=region # for example "us-east-1" $ export AWS_BUCKET=bucket_name # for example "my-textract-study-bucket"
-
Run the script:
$ python src/main.py
-
The report will be generated with a name like
<uuid>.xlsx
:. ├── demo_data │ ├── invoice.pdf │ └── invoices │ ├── other_invoice.jpg │ └── and_one_more_invoice.png ├── output │ └── 456af71d-f7b2-4bf8-87c7-bade21d843d4 │ ├── report.csv │ ├── report.json │ └── report.xlsx ├── readme.md └── src └── main.py
- This script uses busy waiting for Textract job results (in the
retrieve_analyses
function). It is not optimal. In fact, it is pretty terrible for performance. Use notifications instead. - The whole thing is just one file. Terrible for legibility but eh, it works.