Tesseract does not recognize german "für"

Tesseract does not recognize german "für" - ocr

I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re
I use the option -l=deu to give tesseract the hint, that the text is in "deutsch" (german).
Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).
Tesseract often detects "fiir" or "fur".
What can I do to improve this?
reproducible example
docker run --name self.container_name --rm \
--volume $PWD:/pwd \
tesseractshadow/tesseract4re \
tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu
Result:
cat die-fuer-das.png.ocr-result.txt
die fur das
Image die_fuer_das.png:

I found the solution. It needs to be -l deu otherwise the german language does not get used. I accidentally used -l=deu.
Works:
===> tesseract die-fuer-das.png out -l deu; cat out.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
die für das
Wrong language:
===> tesseract die-fuer-das.png out -l=deu; cat out.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
die fur das

Related

iOs simulator does not change language when edit .GlobalPreferences.plist in github actions

I'm creating a CI flow that uses appium and iOs simulator in macos-latest. My app will change language along with simulator language. I found that edit .GlobalPreferences.plist file and then boot the simulator will change to Japanese but the simulator still get default language (en)
Nodejs : 16
Java: 11
Appium: 1.22.3
MacOs: latest
iOs Runtime: 12.4
Device: IphoneX - Simulator
xcrun simctl create TestiPhone com.apple.CoreSimulator.SimDeviceType.iPhone-X com.apple.CoreSimulator.SimRuntime.iOS-12-4 > deviceid.txt
DEVICEUUID=`cat deviceid.txt`
echo $DEVICEUUID
plutil -p ~/Library/Developer/CoreSimulator/Devices/$DEVICEUUID/data/Library/Preferences/.GlobalPreferences.plist
plutil -replace AppleLocale -string "ja_US" ~/Library/Developer/CoreSimulator/Devices/$DEVICEUUID/data/Library/Preferences/.GlobalPreferences.plist
plutil -replace AppleLanguages -json "[ \"ja\" ]" ~/Library/Developer/CoreSimulator/Devices/$DEVICEUUID/data/Library/Preferences/.GlobalPreferences.plist
echo "Verify locale and language ~ JP"
plutil -p ~/Library/Developer/CoreSimulator/Devices/$DEVICEUUID/data/Library/Preferences/.GlobalPreferences.plist
xcrun simctl boot $DEVICEUUID
xcrun simctl bootstatus $DEVICEUUID
xcrun simctl install booted /Users/runner/work/appiumclonetest/appiumclonetest/BuildFiles/mobile.app
When I use iOS 15.0, .GlobalPreferences.plist file does not exist in ~/Library/Developer/CoreSimulator/Devices/$DEVICEUUID/data/Library/Preferences. Where can I found it ?
Can I change simulator language by edit .GlobalPreferences.plist file or do I need to change other things to make it work? I also search for similar discussions but no luck.
Thanks

Why does Tesseract fail with "Empty page" with this image?

I have the following screenshot:
I want to extract the manuscript word count, 3.574 in this case, from that image (see red rectangle below).
To do this, I run following script:
magick screenshot.png -crop 33x20+2+83 screenshot-cropped.png
tesseract screenshot-cropped.png screenshot-ocred -l eng
The first line cuts out the place with the word count and saves it in screenshot-cropped.png which looks like this:
tesseract screenshot-cropped.png screenshot-ocred -l eng is supposed to recognize the characters and save them as text in screenshot-ocred.txt.
However, it produces the following error:
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>ocr.bat
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>magick screenshot.png -crop 33x20+2+83 screenshot-cropped.png
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>tesseract screenshot-cropped.png screenshot-ocred -l eng
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Empty page!!
Empty page!!
How can I fix it, i. e. make Tesseract recognize 3.574 and save it in screenshot-ocred.txt?
Note: All of this runs on Windows. Here is the output of magick --version:
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>magick --version
Version: ImageMagick 7.0.10-7 Q16 x64 2020-04-20 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2018 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Visual C++: 180040629
Features: Cipher DPC Modules OpenCL OpenMP(2.0)
Delegates (built-in): bzlib cairo flif freetype gslib heic jng jp2 jpeg lcms lqr lzma openexr pangocairo png ps raw rsvg tiff webp xml zlib

Adding --psm 7 to the Tesseract call solved the problem (tesseract screenshot-cropped.png screenshot-ocred -l eng --psm 7).

Pandoc filter tikz.py returns "not a valid json value"

I'm using the tikz.py pandocfilter to turn latex tikz code in a markdown file into images in the output html file. I'm running pandoc version 1.17.0.2, python 2.7.10 and ImageMagick 6.9.6-4. My pandoc command is:
pandoc -s -c --mathjax -i- t slidy tik.md --filter tikz.py -o tik.html
(but I've tried simpler commands without slidy or mathjax and they give the same issue)
Where tik.md contains a simple tikz environment:
\begin{tikzpicture}
\draw (0,0) -- (4,0) -- (4,4) -- (0,4) -- (0,0);
\end{tikzpicture}
tikz.py runs well and it seems to successfully generate the image:
$ pandoc -s -c --mathjax -i -t slidy tik.md --filter tikz.py -o tik.html
This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015)
(preloaded format=pdflatex) restricted \write18 enabled.
entering extended mode
(./tikz.tex
LaTeX2e <2015/01/01>
Babel ...
.....
[1] (./tikz.aux) )
Output written on tikz.pdf (1 page, 1077 bytes).
Transcript written on tikz.log.
Created image tikz-images/53200b26dfa2c05d2b92647ef74211f7a2ce0c0e.png
pandoc: Error in $: Failed reading: not a valid json value
I am using an unaltered tikz.py so it is not clear to me where the source of the problem could be. Any thoughts?

I just ran into the same thing in a Ruby filter, and finally figured out that it was because I was trying to write debugging output to standard output. Since a filter has to read standard input and write to standard output, any debugging output on standard output will blow things up.
In my Ruby filter I had to change puts 'debug stuff' to STDERR.puts 'debug stuff'.

Tweak tesseract for better detection of URLs in image

I have an image that I can't get tesseract to recognise as text. All my input text will be URLs.
As you can see, the image is as clear as it can be.
When running tesseract test2.png stdout it returns http:II11111111111111111111111111111111111
1111111111111111111.coml
Which is close, but not correct.
When setting the tessedit_char_whitelist parameter to htp:/1.com it recognises the string correctly (but I want more general recognition of URLs as well).
Passing in a pattern file that looks like below using command line tesseract test2.png stdout --user-patterns ./patterns.txt
\n\*://\n\*
http://\n\*
\n\*.com
doesn't help with recognition. It still prefers I over /. (More details about the pattern file )
I have also tried to set the parameters ok_repeated_ch_non_alphanum_wds to include / (and chs_trailing_punct{1,2} for trailing /, but it doesn't seem to work. Specifying --user-words doesn't help either. (With "words" being http://)
Is there a way of specifying char priority for tesseract?
Version info:
$ tesseract -v
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

You can achieve this by adding the following line to your unicharambigs
file:
3 : I I 3 : / / 1
Extract the unicharambigs file with combine_tessdata -e eng.traineddata eng.unicharambigs
Edit the unicharambigs file, e.g. with nano eng.unicharambigs (make sure to use tabs after both 3s and the second /).
Overwrite the unicharambigs file in the traineddata file with the edited version combine_tessdata -o eng.traineddata eng.unicharambigs
Output using the amended traineddata file:
$ tesseract test2.png stdout
http://11111111111111111111111111111111111
1111111111111111111.coml

Launch modelsim from Libero TCL command

I'm working on a VHDL project in Microsemi Libero.
When I click "Simulate" in the Libero GUI, modelSim starts up and I get to see the results of my simulation.
I'd like to get the same response from a TCL command.
I can do "Execute Script...", and point Libero at a .TCL file containing the single line
run_tool -name {SIM_PRESYNTH}
...and this appears to work just fine (I get messages like "Starting Simulation...Simulation completed...The Execute Script succeeded")... except I don't get a modelSim window opening up to show me my simulation results.
How do I get modelSim to open at the end of a simulation using a TCL command?
many thanks

Just a guess, 7 months late.
In Libero ISE if I want Synplify to popup, in the IDE, I click "Project", then "Profiles" and I set the synthesis tool not to run in batch mode.
Perhaps you can do the same for the simulator, or add a profile, which in tcl would look like this:
add_profile \
-name {Synplify_b} \
-type {synthesis} \
-tool {Synplify} \
-location {somewhere} \
-args {-batch} \
-batch 1
select_profile -name {Synplify_b}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Tesseract does not recognize german "für" - ocr

Related

iOs simulator does not change language when edit .GlobalPreferences.plist in github actions

Why does Tesseract fail with "Empty page" with this image?

Pandoc filter tikz.py returns "not a valid json value"

Tweak tesseract for better detection of URLs in image

Launch modelsim from Libero TCL command

Categories

Resources