Why does Tesseract fail with "Empty page" with this image? - ocr

I have the following screenshot:
I want to extract the manuscript word count, 3.574 in this case, from that image (see red rectangle below).
To do this, I run following script:
magick screenshot.png -crop 33x20+2+83 screenshot-cropped.png
tesseract screenshot-cropped.png screenshot-ocred -l eng
The first line cuts out the place with the word count and saves it in screenshot-cropped.png which looks like this:
tesseract screenshot-cropped.png screenshot-ocred -l eng is supposed to recognize the characters and save them as text in screenshot-ocred.txt.
However, it produces the following error:
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>ocr.bat
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>magick screenshot.png -crop 33x20+2+83 screenshot-cropped.png
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>tesseract screenshot-cropped.png screenshot-ocred -l eng
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Empty page!!
Empty page!!
How can I fix it, i. e. make Tesseract recognize 3.574 and save it in screenshot-ocred.txt?
Note: All of this runs on Windows. Here is the output of magick --version:
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>magick --version
Version: ImageMagick 7.0.10-7 Q16 x64 2020-04-20 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2018 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Visual C++: 180040629
Features: Cipher DPC Modules OpenCL OpenMP(2.0)
Delegates (built-in): bzlib cairo flif freetype gslib heic jng jp2 jpeg lcms lqr lzma openexr pangocairo png ps raw rsvg tiff webp xml zlib

Adding --psm 7 to the Tesseract call solved the problem (tesseract screenshot-cropped.png screenshot-ocred -l eng --psm 7).

Related

Why are the binaries generated by using Meson / Ninja much larger than those compiled by plain valac?

Same source file.
directe compile use valac.
⭕ valac --pkg gtk+-3.0 -X -lm --pkg libcanberra src/Application.vala
⭕ ls Application
-rwxrwxr-x 1 eexpss 48K 05-13 19:59 Application
here is part of my meson.build.
project('com.github.eexpress.cairo-timer', 'vala', 'c')
# i18n = import('i18n')
executable(
meson.project_name(),
'src/Application.vala',
dependencies: [
dependency('gtk+-3.0'),
# dependency('cairo'),
dependency('libcanberra')
],
# link_args : '-X',
# link_args : '-lm',
link_args : ['-X', '-lm',],
install: true
)
and use ninja to compile it.
⭕ cd build; ninja
⭕ ls com.github.eexpress.cairo-timer
-rwxrwxr-x 1 eexpss 98K 05-13 17:02 com.github.eexpress.cairo-timer
So the binary file is more larger than above one. why?
Because you didn't enable debugging for valac, but meson enables it by default. Add -g to valac and the output size should be close to equal.
To see how ninja and valac run the tools to build, enable verbose option by given -v to both commands.
The minor size differences are, as I assume, from file names in them. Compare the outputs, for example, from readelf --debug-dump=line hello to see the diff.

Tesseract does not recognize german "für"

I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re
I use the option -l=deu to give tesseract the hint, that the text is in "deutsch" (german).
Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).
Tesseract often detects "fiir" or "fur".
What can I do to improve this?
reproducible example
docker run --name self.container_name --rm \
--volume $PWD:/pwd \
tesseractshadow/tesseract4re \
tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu
Result:
cat die-fuer-das.png.ocr-result.txt
die fur das
Image die_fuer_das.png:
I found the solution. It needs to be -l deu otherwise the german language does not get used. I accidentally used -l=deu.
Works:
===> tesseract die-fuer-das.png out -l deu; cat out.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
die für das
Wrong language:
===> tesseract die-fuer-das.png out -l=deu; cat out.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
die fur das

Pandoc filter tikz.py returns "not a valid json value"

I'm using the tikz.py pandocfilter to turn latex tikz code in a markdown file into images in the output html file. I'm running pandoc version 1.17.0.2, python 2.7.10 and ImageMagick 6.9.6-4. My pandoc command is:
pandoc -s -c --mathjax -i- t slidy tik.md --filter tikz.py -o tik.html
(but I've tried simpler commands without slidy or mathjax and they give the same issue)
Where tik.md contains a simple tikz environment:
\begin{tikzpicture}
\draw (0,0) -- (4,0) -- (4,4) -- (0,4) -- (0,0);
\end{tikzpicture}
tikz.py runs well and it seems to successfully generate the image:
$ pandoc -s -c --mathjax -i -t slidy tik.md --filter tikz.py -o tik.html
This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015)
(preloaded format=pdflatex) restricted \write18 enabled.
entering extended mode
(./tikz.tex
LaTeX2e <2015/01/01>
Babel ...
.....
[1] (./tikz.aux) )
Output written on tikz.pdf (1 page, 1077 bytes).
Transcript written on tikz.log.
Created image tikz-images/53200b26dfa2c05d2b92647ef74211f7a2ce0c0e.png
pandoc: Error in $: Failed reading: not a valid json value
I am using an unaltered tikz.py so it is not clear to me where the source of the problem could be. Any thoughts?
I just ran into the same thing in a Ruby filter, and finally figured out that it was because I was trying to write debugging output to standard output. Since a filter has to read standard input and write to standard output, any debugging output on standard output will blow things up.
In my Ruby filter I had to change puts 'debug stuff' to STDERR.puts 'debug stuff'.

Tweak tesseract for better detection of URLs in image

I have an image that I can't get tesseract to recognise as text. All my input text will be URLs.
As you can see, the image is as clear as it can be.
When running tesseract test2.png stdout it returns http:II11111111111111111111111111111111111
1111111111111111111.coml
Which is close, but not correct.
When setting the tessedit_char_whitelist parameter to htp:/1.com it recognises the string correctly (but I want more general recognition of URLs as well).
Passing in a pattern file that looks like below using command line tesseract test2.png stdout --user-patterns ./patterns.txt
\n\*://\n\*
http://\n\*
\n\*.com
doesn't help with recognition. It still prefers I over /. (More details about the pattern file )
I have also tried to set the parameters ok_repeated_ch_non_alphanum_wds to include / (and chs_trailing_punct{1,2} for trailing /, but it doesn't seem to work. Specifying --user-words doesn't help either. (With "words" being http://)
Is there a way of specifying char priority for tesseract?
Version info:
$ tesseract -v
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
You can achieve this by adding the following line to your unicharambigs
file:
3 : I I 3 : / / 1
Extract the unicharambigs file with combine_tessdata -e eng.traineddata eng.unicharambigs
Edit the unicharambigs file, e.g. with nano eng.unicharambigs (make sure to use tabs after both 3s and the second /).
Overwrite the unicharambigs file in the traineddata file with the edited version combine_tessdata -o eng.traineddata eng.unicharambigs
Output using the amended traineddata file:
$ tesseract test2.png stdout
http://11111111111111111111111111111111111
1111111111111111111.coml

Thales Payshield command "JS"

Does anyone knows format of the "JS" command (ARQC Verification and/or ARPC Generation - CUP) in license HSM9-LIC031.
The "JS" command is China Union Pay commands.
I am using HSM, but i don't know the command format.
The command looks like Command Code 2A JS.
You can read more about JS command in Thales HSM documentation
payShield 9000 v2.3b
Host Command Reference Manual Addendum for Optional License LIC031 (CUP Commands)
pages 4,5
Also in that section, you can find all about the format of JS command.