Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't index hOCR documents on Windows #174

Open
petr-fleischmann opened this issue Jun 11, 2021 · 4 comments
Open

Can't index hOCR documents on Windows #174

petr-fleischmann opened this issue Jun 11, 2021 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@petr-fleischmann
Copy link

petr-fleischmann commented Jun 11, 2021

Some hOCR can't be parsed (0.6.0 version) becasue they use diacritics chars in content. For example chars: "ůá" words: aráme, ků
Ex hOCR file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract v4.0.0.20181030' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='bbox 0 0 2488 3510; ppageno 0'>
   <div class='ocr_carea' id='block_1_4' title="bbox 2407 1654 2482 3505">
    <p class='ocr_par' id='par_1_4' lang='ces' title="bbox 2407 1654 2482 3505">
     <span class='ocr_line' id='line_1_4' title="bbox 2407 1654 2482 3505; textangle 90; x_size 35; x_descenders 7; x_ascenders 12">
      <span class='ocrx_word' id='word_1_13' title='bbox 2447 2681 2463 2757; x_wconf 0'>aráme</span>
      <span class='ocrx_word' id='word_1_15' title='bbox 2420 2481 2462 2530; x_wconf 66'>ků</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

throws error:

2021-06-11 08:42:33.557 ERROR (qtp1516500233-30) [   x:testOCR] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id ocrdoc-79 to the index; possible analysis error.
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:251)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:289)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223)
	at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:507)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:145)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:121)
	at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:84)
	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2578)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:780)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:566)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:423)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:505)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Failed to parse the OCR markup, make sure your files are well-formed and your regions start/end on complete tags! (Source was: c:/OCR/MNB_006_045/2ff558170f3aea11a96000155d02ad02.hocr)
	at de.digitalcollections.solrocr.formats.OcrParser.next(OcrParser.java:144)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilter.readNextWord(OcrCharFilter.java:29)
	at de.digitalcollections.solrocr.lucene.filters.OcrCharFilter.read(OcrCharFilter.java:125)
	at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:675)
	at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:898)
	at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:148)
	at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:41)
	at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49)
	at org.apache.lucene.analysis.en.PorterStemFilter.incrementToken(PorterStemFilter.java:67)
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:812)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
	at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:964)
	at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:342)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:289)
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:236)
	... 51 more
Caused by: [com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxException: Reader (of type com.ctc.wstx.io.MergedReader) returned 0 characters, even when asked to read up to 4000
 at [row,col {unknown-source}]: [1,1]
	at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
	at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:728)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3678)
	at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:860)
	at de.digitalcollections.solrocr.formats.hocr.HocrParser.seekToNextWord(HocrParser.java:264)
	at de.digitalcollections.solrocr.formats.hocr.HocrParser.readNext(HocrParser.java:75)
	at de.digitalcollections.solrocr.formats.OcrParser.next(OcrParser.java:140)
	... 70 more
Caused by: com.ctc.wstx.exc.WstxException: Reader (of type com.ctc.wstx.io.MergedReader) returned 0 characters, even when asked to read up to 4000
 at [row,col {unknown-source}]: [1,1]
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:98)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
	at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4647)
	at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4146)
	at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3720)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3676)
	... 74 more

hOCR without diacritics "ů, á" is OK.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract v4.0.0.20181030' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='bbox 0 0 2488 3510; ppageno 0'>
   <div class='ocr_carea' id='block_1_4' title="bbox 2407 1654 2482 3505">
    <p class='ocr_par' id='par_1_4' lang='ces' title="bbox 2407 1654 2482 3505">
     <span class='ocr_line' id='line_1_4' title="bbox 2407 1654 2482 3505; textangle 90; x_size 35; x_descenders 7; x_ascenders 12">
      <span class='ocrx_word' id='word_1_13' title='bbox 2447 2681 2463 2757; x_wconf 0'>arme</span>
      <span class='ocrx_word' id='word_1_15' title='bbox 2420 2481 2462 2530; x_wconf 66'>k</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>
@jbaiter jbaiter self-assigned this Jun 11, 2021
@jbaiter jbaiter added the bug Something isn't working label Jun 11, 2021
@jbaiter
Copy link
Member

jbaiter commented Jun 11, 2021

Thank you for the detailed bug report, this should be enough to pinpoint the cause of the bug and hopefully find a fix, will report once I've gotten around to probing it (might be a while, currently on parental leave, i.e. will happen when the little one has had a good night and I'm not too swamped with household stuff (-:)

@jbaiter
Copy link
Member

jbaiter commented Jun 14, 2021

So I just tried to reproduce the issue with the example document from the OP, but it indexes just fine for me 🤔

Can you share the file that causes the issue? I.e. the actual c:/OCR/MNB_006_045/2ff558170f3aea11a96000155d02ad02.hocr file on disk.

Also, could you try running the same setup with the same data inside of a Docker container with a Linux system? The plugin was only tested on Linux and uses a few low-level interfaces that might behave differently on Windows systems, would be good to verify if this is the case.

@petr-fleischmann
Copy link
Author

Thanks for the quick reply

My results are:
OS solr version plugin version status
Windows 10 8.2 0.6.0 NOK
Windows 10 8.2 0.5.0 OK
W10(wsl2 ubuntu + docker -> linux) 8.7 0.6.0 OK

I think you're right. The problem will be in windows (for 0.6.0 version)

2ff558170f3aea11a96000155d02ad02.zip

@jbaiter
Copy link
Member

jbaiter commented Jun 15, 2021

Thank you! I'll try setting up a windows environment to reproduce and hopefully fix the issue, might take a bit, though 😬

@jbaiter jbaiter changed the title Can't parse hOCR file with diacritics content Can't index hOCR documents on Windows Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants