[Bug]: Existing text is completely replaced with other characters #1337

david-sledge · 2024-06-18T23:51:47Z

Describe the bug

Found an issue with certain PDFs that already have text where the text is replaced with other characters and renders the PDFs unreadable. This happens with the --redo-ocr and --skip-text flags. Attached are (a) a sample PDF (b) the results of it being OCRed, and (c) a zip file containing everything needed to reproduce the issue.

Steps to reproduce

1. Download the tarball to a linux machine with Docker installed.
2. Run the following command chain: tar -xzf bad-pdf-example.tar.gz && cd bad-pdf-example && docker run --rm -v .:/root/test-files -it $(docker build -q -t ocrmypdf-test .) && docker rmi ocrmypdf-test:latest
3. Open test-redo-ocr-result.pdf and test-skip-text-result.pdf

Files

test.pdf
test-redo-ocr-result.pdf
test-skip-text-result.pdf
bad-pdf-example.tar.gz

How did you download and install the software?

Linux package manager (apt, dnf, etc.), Docker container

OCRmyPDF version

16.3.1

Relevant log output

tesseract 5.4.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.17
OCRmyPDF version:
16.3.1
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 skipping all processing on this page                                                                                                                                                                                      _pipeline.py:330
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                                                     ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                             _metadata.py:62
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                                                                    _pipeline.py:989
Total file size ratio: 0.06 savings: -1515.7%                                                                                                                                                                                   _pipeline.py:992
Output file is a PDF/A-2B (as expected)                                                                                                                                                                                           _common.py:441
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 redoing OCR                                                                                                                                                                                                               _pipeline.py:327
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                                                     ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                             _metadata.py:62
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                                                                    _pipeline.py:989
Total file size ratio: 0.06 savings: -1554.7%                                                                                                                                                                                   _pipeline.py:992
Output file is a PDF/A-2B (as expected)

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-06-21T23:02:28Z

The problem with this file is that does not embed the fonts it uses. In this case, Arial Bold and Arial Bold Italic. It was previous processed by Nitro Pro 13.

When Ghostscript (which OCRmyPDF uses), it replaces the missing with a substitute, using "DroidSansFallback". The kerning of the substitute is different, so the PDF viewer sees spaces between letters. At least for me. I don't know how an Asian font was substituted in your version.

ocrmypdf --output-type pdf avoids Ghostscript, and produces a usable result.

Try doing
gs -sDEVICE=pdfwrite -o output.pdf test.pdf
and see if you can reproduce the Japanese-Korean version, then reporting to Ghostscript. I won't report because there's potentially personal information in the test file that is not mine.

jbarlow83 · 2024-06-21T23:03:53Z

ocrmypdf --force-ocr would also fix this file completely, with or without Ghostscript.

I am considering adding a warning about Ghostscript font substitution, especially if someone else encounters this. Ghostscript has had several issues with mangling text recently.

beshtim · 2024-06-25T09:21:01Z

I think i have got the same problem

I am running with --redo-ocr also and this hieroglyphs appeas sometimes

david-sledge added the triage Issue needs triage label Jun 18, 2024

david-sledge assigned jbarlow83 Jun 18, 2024

jbarlow83 added third party issue Problem with a third party dependency and removed triage Issue needs triage labels Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Existing text is completely replaced with other characters #1337

[Bug]: Existing text is completely replaced with other characters #1337

david-sledge commented Jun 18, 2024

jbarlow83 commented Jun 21, 2024

jbarlow83 commented Jun 21, 2024

beshtim commented Jun 25, 2024 •

edited

Loading

[Bug]: Existing text is completely replaced with other characters #1337

[Bug]: Existing text is completely replaced with other characters #1337

Comments

david-sledge commented Jun 18, 2024

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

jbarlow83 commented Jun 21, 2024

jbarlow83 commented Jun 21, 2024

beshtim commented Jun 25, 2024 • edited Loading

beshtim commented Jun 25, 2024 •

edited

Loading