Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR and spatial search #70

Open
legsak1mbo opened this issue Dec 10, 2019 · 8 comments
Open

OCR and spatial search #70

legsak1mbo opened this issue Dec 10, 2019 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@legsak1mbo
Copy link

More of a feature request than an issue but it would be incredibly useful if the HOCR data could be used for querying as well as highlighting. For example searching for a word within a specific region of the document by its page and/or coordinates.

@jbaiter
Copy link
Member

jbaiter commented Jan 28, 2020

Doing this "properly" for arbitrary regions is out of scope for this specific plugin I'm afraid, since it does not store any information about the actual coordinates in the index and thus can't query for it (e.g. like solr's Spatial Search).

One hacky way to go about this would be to add a filterBbbox parameter that is then checked at highlighting time against the bounding boxes in the OCR file. All snippets falling outside of the queried bounding box would be filtered out. This shouldn't be too hard to implement, since we have access to the bounding box information at highlighting time and can thus filter very easily based on it. This could be a good issue for a pull request for a new developer :-)

There is however currently support for filtering by a specific page in a document, check out the hl.ocr.pageId parameter in the documentation. Combined with a fq on the document id it allows you to limit the snippet generation to a single page in a single document. We use this to implement the IIIF Content Search API, which requires searching in a single page of a document.

@jbaiter jbaiter added enhancement New feature or request good first issue Good for newcomers labels Jan 28, 2020
@legsak1mbo
Copy link
Author

Sorry I've been so long coming back to this. The ideal would be if we could search for "the first instance of a term after the previous" and/or "a term X & Y away from an anchor term" where the anchor would be something like a chapter title. Java isn't really my forte but I'll certainly look into it.

@jbaiter
Copy link
Member

jbaiter commented Jun 17, 2020

If you want to implement search inside of chapters, you could just index your documents at the chapter-level by creating source pointers that point to the markup for that chapter, this is described in the documentation here: https://dbmdz.github.io/solr-ocrhighlighting/indexing/#one-or-more-partial-files-per-solr-document.

Otherwise this is hard to implement with Lucene/Solr and the plugin in its current form, you could try sloppy phrase queries like "<chapter_word> <term>"~20, which would yield all spans where <term> appears within 20 token-positions of <chapter_word>, but this will also include cases where the term appears before the chapter.

I'm not sure if the approach proposed in my first response is going to work for you, since you'd need to know the specific region on a given page where a match is allowed to occur. This could be useful for a feature like "search only in headers/footers" (if those headers/footers appear in the same positions every time), but that is not your use case if I understood you correctly?

@legsak1mbo
Copy link
Author

legsak1mbo commented Jun 17, 2020

What I'm thinking is something like an old census form where the scans are all slightly wonky. The idea would be that you could use something like "Name" as an anchor and search for the first instance of that then, knowing that the the subject's name would be X & Y pixels from the anchor term provide the actual name as the result.

So like a position-aware query but based on the actual OCR coordinates rather than the position of the term in the text.

@jbaiter
Copy link
Member

jbaiter commented Jun 17, 2020

I see! A hacky and probably inefficient way to do this without changes to the plugin could be:

  1. Perform a query for Name to get candidate locations for the anchor
  2. Apply some heuristics to determine which of those candidates are actually anchors
  3. Based on the anchor location, determine one or more regions where the subject name is likely to be located
  4. Query for the terms on the anchor location's page (with the hl.ocr.pageId parameter) and throw away all snippets that don't overlap with the subject name regions
  5. You're hopefully left with matches for your subject name in close proximity to a Name field title.

@legsak1mbo
Copy link
Author

Right, I see. But presumably that wouldn't work with a wildcard search (search for any name in the region) because you wouldn't get the highlighting for that?

@jbaiter
Copy link
Member

jbaiter commented Jun 17, 2020

Yes, correct, if you're just interested in the general content of a region, you can replace step 4 and 5 with just parsing the OCR for the page and extracting the text in the subject name regions yourself.

@legsak1mbo
Copy link
Author

Thanks. I'll come back with any progress I make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants