OCR and spatial search #70

legsak1mbo · 2019-12-10T12:27:04Z

More of a feature request than an issue but it would be incredibly useful if the HOCR data could be used for querying as well as highlighting. For example searching for a word within a specific region of the document by its page and/or coordinates.

jbaiter · 2020-01-28T09:47:04Z

Doing this "properly" for arbitrary regions is out of scope for this specific plugin I'm afraid, since it does not store any information about the actual coordinates in the index and thus can't query for it (e.g. like solr's Spatial Search).

One hacky way to go about this would be to add a filterBbbox parameter that is then checked at highlighting time against the bounding boxes in the OCR file. All snippets falling outside of the queried bounding box would be filtered out. This shouldn't be too hard to implement, since we have access to the bounding box information at highlighting time and can thus filter very easily based on it. This could be a good issue for a pull request for a new developer :-)

There is however currently support for filtering by a specific page in a document, check out the hl.ocr.pageId parameter in the documentation. Combined with a fq on the document id it allows you to limit the snippet generation to a single page in a single document. We use this to implement the IIIF Content Search API, which requires searching in a single page of a document.

legsak1mbo · 2020-06-17T11:41:16Z

Sorry I've been so long coming back to this. The ideal would be if we could search for "the first instance of a term after the previous" and/or "a term X & Y away from an anchor term" where the anchor would be something like a chapter title. Java isn't really my forte but I'll certainly look into it.

jbaiter · 2020-06-17T11:50:31Z

If you want to implement search inside of chapters, you could just index your documents at the chapter-level by creating source pointers that point to the markup for that chapter, this is described in the documentation here: https://dbmdz.github.io/solr-ocrhighlighting/indexing/#one-or-more-partial-files-per-solr-document.

Otherwise this is hard to implement with Lucene/Solr and the plugin in its current form, you could try sloppy phrase queries like "<chapter_word> <term>"~20, which would yield all spans where <term> appears within 20 token-positions of <chapter_word>, but this will also include cases where the term appears before the chapter.

I'm not sure if the approach proposed in my first response is going to work for you, since you'd need to know the specific region on a given page where a match is allowed to occur. This could be useful for a feature like "search only in headers/footers" (if those headers/footers appear in the same positions every time), but that is not your use case if I understood you correctly?

legsak1mbo · 2020-06-17T12:02:26Z

What I'm thinking is something like an old census form where the scans are all slightly wonky. The idea would be that you could use something like "Name" as an anchor and search for the first instance of that then, knowing that the the subject's name would be X & Y pixels from the anchor term provide the actual name as the result.

So like a position-aware query but based on the actual OCR coordinates rather than the position of the term in the text.

jbaiter · 2020-06-17T12:19:52Z

I see! A hacky and probably inefficient way to do this without changes to the plugin could be:

Perform a query for Name to get candidate locations for the anchor
Apply some heuristics to determine which of those candidates are actually anchors
Based on the anchor location, determine one or more regions where the subject name is likely to be located
Query for the terms on the anchor location's page (with the hl.ocr.pageId parameter) and throw away all snippets that don't overlap with the subject name regions
You're hopefully left with matches for your subject name in close proximity to a Name field title.

legsak1mbo · 2020-06-17T12:25:06Z

Right, I see. But presumably that wouldn't work with a wildcard search (search for any name in the region) because you wouldn't get the highlighting for that?

jbaiter · 2020-06-17T12:38:12Z

Yes, correct, if you're just interested in the general content of a region, you can replace step 4 and 5 with just parsing the OCR for the page and extracting the text in the subject name regions yourself.

legsak1mbo · 2020-06-17T12:39:29Z

Thanks. I'll come back with any progress I make.

jbaiter added enhancement New feature or request good first issue Good for newcomers labels Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR and spatial search #70

OCR and spatial search #70

legsak1mbo commented Dec 10, 2019

jbaiter commented Jan 28, 2020 •

edited

Loading

legsak1mbo commented Jun 17, 2020

jbaiter commented Jun 17, 2020

legsak1mbo commented Jun 17, 2020 •

edited

Loading

jbaiter commented Jun 17, 2020

legsak1mbo commented Jun 17, 2020

jbaiter commented Jun 17, 2020

legsak1mbo commented Jun 17, 2020

OCR and spatial search #70

OCR and spatial search #70

Comments

legsak1mbo commented Dec 10, 2019

jbaiter commented Jan 28, 2020 • edited Loading

legsak1mbo commented Jun 17, 2020

jbaiter commented Jun 17, 2020

legsak1mbo commented Jun 17, 2020 • edited Loading

jbaiter commented Jun 17, 2020

legsak1mbo commented Jun 17, 2020

jbaiter commented Jun 17, 2020

legsak1mbo commented Jun 17, 2020

jbaiter commented Jan 28, 2020 •

edited

Loading

legsak1mbo commented Jun 17, 2020 •

edited

Loading