Skip to content

Docker container image built with Jupyter Notebook and Tabula for PDF scraping

License

Notifications You must be signed in to change notification settings

aeksco/jupyter-tabula

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jupyter-tabula

Docker container image built with Jupyter Notebook and Tabula for PDF scraping. Includes an example notebook to help you get started.

Repository hosted on GitHub at aeksco/jupyter-tabula.

Docker container image hosted on Docker Hub at aeksco/jupyter-tabula.

Usage

Running the Jupyter Notebook server

docker run -it -p 8888:8888 aeksco/jupyter-tabula

Example Notebook

The Example_01.ipynb opens a single-page PDF and parses the table contained within.

Building the Docker Image

Use the following instructions if you want to modify this Docker image and push a different copy to DockerHub.

  1. Create a new Dockerhub Repository named jupyter-tabula.

  2. Build the Docker image by running the following command in the jupyter-tabula directory:

docker build -t jupyter-tabula .
  1. Test the image locally with the following command:
docker run -it --rm --pid=host -p 8888:8888 jupyter-tabula
  1. Tag the image and push to Dockerhub:
docker tag bb38976d03cf your_docker_hub_username/jupyter-tabula:latest
docker push your_docker_hub_username/jupyter-tabula

Notes

I would recommend this container image for testing and hacking purposes. This image is a bit heavy (2.02GB) since it extends the tensorflow/tensorflow container image. This was done mainly to save time since the Tensorflow container already has all the relevant dependencies installed and configured.