Skip to content

A Python package for extracting publication text from Smith College Museum of Art invoice documents. The script parses the documents to retrieve attributions such as author, title, format, publication date, etc. The extracted information is then merged into an Excel file for convenient usage and integration into the Mimsy database.

License

Notifications You must be signed in to change notification settings

vickyxu22/SCMA_invoice_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCMA_invoice_extractor

docx_extractor is a Python package for extracting publication text from Smith College Museum of Art invoice documents in DOCX format. The script parses the documents to retrieve attributions such as author, title, format, publication date, etc. The extracted information is then merged into an Excel file for convenient usage and integration into the Mimsy database.

Installation

To install docx_extractor, use pip:

pip install docx-extractor

Usage

The package provides a function process_documents(folder_path) to extract information from .docx files within a specified folder.

Example usage:

from docx_extractor.extract import process_documents

# Replace 'folder_path' with the path to your folder containing .docx files
folder_path = '/path/to/your/folder'
process_documents(folder_path)

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Python package for extracting publication text from Smith College Museum of Art invoice documents. The script parses the documents to retrieve attributions such as author, title, format, publication date, etc. The extracted information is then merged into an Excel file for convenient usage and integration into the Mimsy database.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages