solr-ocrhighlighting

This Solr plugin lets you put word-level OCR text into one or more of you documents' fields and then allows you to obtain structured highlighting data with the text and its position on the page at query time. All this without having to store the OCR data in the index itself, instead reusing the existing OCR files on disk.

It works by extending Solr's standard UnifiedHighlighter with support for loading external field values and determining OCR positions from those field values. This means that all options and query types supported by the UnifiedHighlighter are also supported for OCR highlighting. The plugin also works transparently with non-OCR fields and just lets the default implementation handle those.

The plugin works with all Solr versions >= 7.5.

Features

Index hOCR, ALTO or MiniOCR directly without preprocessing
Retrieve all the information needed to render a highlighted snippet view directly from Solr, without postprocessing
Keeps your index size manageable by not storing the OCR in the index

Installation

Download the latest JAR from the GitHub Releases Page
Drop the JAR into the core/lib/ directory for your Solr core
Refer to the Documentation for instructions on how to configure Solr and index documents

Compiling

If you want to use the latest bleeding-edge version, you can also compile the plugin yourself. For this you will need at least Java 8 and Maven:

$ mvn package

The JAR will be in target/solr-ocrhighlighting-$version.jar.

Running the example

The repository includes a full-fledged example setup based on the Google Books 1000 and the BNL L'Union Newspaper datasets. The Google Books dataset consists of 1000 Volumes along with their OCRed text in the hOCR format and all book pages as full resolution JPEG images. The BNL dataset consists of 2712 newspaper issues in the ALTO format and all pages as high resolution TIF images.

The example ships with a search interface that allows querying the OCRed texts and displays the matching passages as highlighted image and text snippets. Also included is a small IIIF-Viewer that allows viewing the documents and searching for text within them.

To run:

cd example
docker-compose up -d
./ingest.py
Access http://localhost:8181 in your browser

For more information about the example setup, refer to the documentation.

Limitations

The supported file size is limited to 2GiB, since Lucene uses 32 bit integers throughout for storing offsets

Contributing

Found a bug? Want a new feature? Make a fork, create a pull request.

For larger changes/features, it's usually wise to open an issue before starting the work, so we can discuss if it's a fit.

Support us!

We always appreciate if users let us know how they're using our software and libraries. It helps us to focus our efforts on our open source offerings, so we can create even more useful stuff for the community.

So don't hesitate to drop us a line at [email protected] if you could make use of the plugin :-)

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
docs		docs
example		example
integration-tests		integration-tests
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
UPSTREAM.md		UPSTREAM.md
mkdocs.yml		mkdocs.yml
pom.xml		pom.xml
settings.xml		settings.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

solr-ocrhighlighting

Features

Installation

Compiling

Running the example

Limitations

Contributing

Support us!

License

About

Releases

Packages

Languages

License

hatfieldlibrary/solr-ocrhighlighting

Folders and files

Latest commit

History

Repository files navigation

solr-ocrhighlighting

Features

Installation

Compiling

Running the example

Limitations

Contributing

Support us!

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages