Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Existing text is completely replaced with other characters #1337

Open
david-sledge opened this issue Jun 18, 2024 · 3 comments
Open
Assignees
Labels
third party issue Problem with a third party dependency

Comments

@david-sledge
Copy link

Describe the bug

Found an issue with certain PDFs that already have text where the text is replaced with other characters and renders the PDFs unreadable. This happens with the --redo-ocr and --skip-text flags. Attached are (a) a sample PDF (b) the results of it being OCRed, and (c) a zip file containing everything needed to reproduce the issue.

Steps to reproduce

1. Download the tarball to a linux machine with Docker installed.
2. Run the following command chain: tar -xzf bad-pdf-example.tar.gz && cd bad-pdf-example && docker run --rm -v .:/root/test-files -it $(docker build -q -t ocrmypdf-test .) && docker rmi ocrmypdf-test:latest
3. Open test-redo-ocr-result.pdf and test-skip-text-result.pdf

Files

test.pdf
test-redo-ocr-result.pdf
test-skip-text-result.pdf
bad-pdf-example.tar.gz

How did you download and install the software?

Linux package manager (apt, dnf, etc.), Docker container

OCRmyPDF version

16.3.1

Relevant log output

tesseract 5.4.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.17
OCRmyPDF version:
16.3.1
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 skipping all processing on this page                                                                                                                                                                                      _pipeline.py:330
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                                                     ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                             _metadata.py:62
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                                                                    _pipeline.py:989
Total file size ratio: 0.06 savings: -1515.7%                                                                                                                                                                                   _pipeline.py:992
Output file is a PDF/A-2B (as expected)                                                                                                                                                                                           _common.py:441
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 redoing OCR                                                                                                                                                                                                               _pipeline.py:327
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                                                     ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                             _metadata.py:62
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                                                                    _pipeline.py:989
Total file size ratio: 0.06 savings: -1554.7%                                                                                                                                                                                   _pipeline.py:992
Output file is a PDF/A-2B (as expected)
@david-sledge david-sledge added the triage Issue needs triage label Jun 18, 2024
@jbarlow83
Copy link
Collaborator

The problem with this file is that does not embed the fonts it uses. In this case, Arial Bold and Arial Bold Italic. It was previous processed by Nitro Pro 13.

When Ghostscript (which OCRmyPDF uses), it replaces the missing with a substitute, using "DroidSansFallback". The kerning of the substitute is different, so the PDF viewer sees spaces between letters. At least for me. I don't know how an Asian font was substituted in your version.

ocrmypdf --output-type pdf avoids Ghostscript, and produces a usable result.

Try doing
gs -sDEVICE=pdfwrite -o output.pdf test.pdf
and see if you can reproduce the Japanese-Korean version, then reporting to Ghostscript. I won't report because there's potentially personal information in the test file that is not mine.

@jbarlow83 jbarlow83 added third party issue Problem with a third party dependency and removed triage Issue needs triage labels Jun 21, 2024
@jbarlow83
Copy link
Collaborator

ocrmypdf --force-ocr would also fix this file completely, with or without Ghostscript.

I am considering adding a warning about Ghostscript font substitution, especially if someone else encounters this. Ghostscript has had several issues with mangling text recently.

@beshtim
Copy link

beshtim commented Jun 25, 2024

I think i have got the same problem
Снимок экрана 2024-06-25 121753
Снимок экрана 2024-06-25 122040

I am running with --redo-ocr also and this hieroglyphs appeas sometimes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third party issue Problem with a third party dependency
Projects
None yet
Development

No branches or pull requests

3 participants