Skip to content

Analysing expense reports/invoices with AWS Textract and boto3.

Notifications You must be signed in to change notification settings

mycielski/textract_study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AWS Textract study

Code style: black

Some code I've written when learning what is Textract and how to use it.

The project also contains a shell script which uses the AWS CLI v2 to perform the same task.

How to use it?

  1. Put your invoices in the demo_data directory. Here's an example of the directory structure:

    .
    ├── demo_data
    │   ├── invoice.pdf
    │   └── invoices
    │       ├── other_invoice.jpg
    │       └── and_one_more_invoice.png
    ├── readme.md
    └── src
        └── main.py
  2. Provide your AWS credentials as environment variables:

    $ export AWS_ACCESS_KEY_ID=your_access_key_id          # for example "AKIAIOSFODNN7EXAMPLE"
    $ export AWS_SECRET_ACCESS_KEY=your_secret_access_key  # for example "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    $ export AWS_REGION=region                             # for example "us-east-1"
    $ export AWS_BUCKET=bucket_name                        # for example "my-textract-study-bucket"
  3. Run the script:

    $ python src/main.py
  4. The report will be generated with a name like <uuid>.xlsx:

    .
    ├── demo_data
    │   ├── invoice.pdf
    │   └── invoices
    │       ├── other_invoice.jpg
    │       └── and_one_more_invoice.png
    ├── output
    │   └── 456af71d-f7b2-4bf8-87c7-bade21d843d4
    │       ├── report.csv
    │       ├── report.json
    │       └── report.xlsx
    ├── readme.md
    └── src
        └── main.py

Notes

  • This script uses busy waiting for Textract job results (in the retrieve_analyses function). It is not optimal. In fact, it is pretty terrible for performance. Use notifications instead.
  • The whole thing is just one file. Terrible for legibility but eh, it works.