Tesseract training archieves a worse result than without training - ocr

I used the Windows installer of tesseract-ocr 3.02.02 (I didn't find a newer one for 3.04). My image is a JPEG with a resolution of 600dpi (3507x4960) which is a scanned blank "certificate of incapacity for work". The OCR result without training is much more accurate than after training. So what am I doing wrong?
This way I build my box file:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg %TESSLANG% -l deu batch.nochop makebox
Using jTessBoxEditor I fixed every box by hand. Then I started the training:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg %TESSLANG% -l deu nobatch box.train
unicharset_extractor %TESSLANG%.box
shapeclustering -F font_properties -U unicharset %TESSLANG%.tr
mftraining -F font_properties -U unicharset -O %LANG%.unicharset %TESSLANG%.tr
cntraining %TESSLANG%.tr
MOVE inttemp %LANG%.inttemp
MOVE normproto %LANG%.normproto
MOVE pffmtable %LANG%.pffmtable
MOVE shapetable %LANG%.shapetable
combine_tessdata %LANG%.
COPY %LANG%.traineddata %TESSERACT_HOME%\tessdata /Y
The OCR without training (archieving the best results) is done like:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg without_training -l deu
Using the traineddata:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg with_training -l %LANG%
Maybe I am wrong but I expect a perfect result (I use the same JPEG for training and OCRing).
Here the first part of without_training.txt:
Paul Albrechts Verlag, 22952 Lütjensee Bei verspäteter Vorlage droht Krankengeldverlust!
Krankenkasse bzw. Kostenträger
Name, Vorname des Versicherten
geb. am
Kassen—Nr. Versicherten—Nr. Status
Betriebsstätten-Nr. Arzt—Nr. Datum
And the first part of with_training.txt:
Pau/A/brechrs Ver/ag, 22952 Lüfjensee Be! verspäteter vor!age droht Krankenge!dver!ust!
Krankenkasse bzw. Kostenträger
Name, Vorname des Versicherten
geb. am
Kassen-Nr. Versicherten-Nr. status
Betriebsstätten-Nr. Arzt-Nr. Datum

In my case adding the language "deu" did the trick:
tesseract %TESSLANG%.jpg with_training -l %LANG%+deu
instead of
tesseract %TESSLANG%.jpg with_training -l %LANG%

Related

Failed in generating Tesseract traineddata

I'm using Tesseract v5.0.1.20220118 on Windows 10, training a font only have letter "P" and "Q".
When I get to the step
mftraining -F font_properties.txt -U unicharset -O normal.unicharset pq.normal.exp0.tr
The pffmtable file is not generated.
And when I run code cntraining pq.normal.exp0.tr
It shows me
Reading pq.normal.exp0.tr ...
Clustering ...
N == sizeof(Cluster->Mean):Error:Assert failed:in file ../../../src/classify/cluster.cpp, line 2526
Why it goes wrong? How can I fix it?
I only have inttemp and shapetable generated, but the tutorial says there will be four files include shapetable, inttemp, pffmtable and normproto, I wonder that maybe is beacuse of the font only have letter "P" and "Q", but I have no idea how to solve it.
Please read the docs:
https://tesseract-ocr.github.io/tessdoc/#training-for-tesseract-5
Use the right tools:
https://github.com/tesseract-ocr/tesstrain

Issue to train tesseract-OCR 4 - Empy shape table

I am trying to train Tesseract 4 with particular pictures (to read multimeters with 7 segments),
please note that I am aware of the allready trained data from Arthur Augusto at https://github.com/arturaugusto/display_ocr but I need to train Tesseract over my own data.
In order to train tess, I followed differents tutorials (as https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models-4ba9861595e7 or https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/)
but i allways get problem when running the shapeclustering command with my own data
(With example data as https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972, every things is working fine)
Indeed when I try to do the shapeclusturing command it have this output screenshot
Then my shape_table is empty and the trainig could'nt be efficient...
With example data it's working fine and the shape_table is well filled
I am guessing that I have issue with box file generation, here is my process to create box file :
I use the
tesseract imageFileName.tif imageFileName batch.nochop makebox
command to generate box file and then i edit it with JtessboxEditor.
So I can't see where I'am wrong with my .box/.tif data couple.
Have a good day & thanks for helping me
\n
Adrien
Here is my full batch script for training after having generated and edited box files.
set name=sev7.exp0
set shortName=sev7
echo Run Tesseract for Training..
tesseract.exe %name%.tif %name% nobatch box.train
echo Compute the Character Set..
unicharset_extractor.exe %name%.box
shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering..
cntraining.exe %name%.tr
echo Rename Files..
rename normproto %shortName%.normproto
rename inttemp %shortName%.inttemp
rename pffmtable %shortName%.pffmtable
rename shapetable %shortName%.shapetable
echo Create Tessdata..
combine_tessdata.exe %shortName%.
echo. & pause
Ok so finally I achieved to train tesseract.
The solution is to add a --psm parameter when using the command
tesseract.exe %name%.tif %name% nobatch box.train
as
tesseract.exe %name%.%typeFile% %name% --psm %psm% nobatch box.train
note that all the psm value are :
REM pagesegmode values are:
REM 0 = Orientation and script detection (OSD) only.
REM 1 = Automatic page segmentation with OSD.
REM 2 = Automatic page segmentation, but no OSD, or OCR
REM 3 = Fully automatic page segmentation, but no OSD. (Default)
REM 4 = Assume a single column of text of variable sizes.
REM 5 = Assume a single uniform block of vertically aligned text.
REM 6 = Assume a single uniform block of text.
REM 7 = Treat the image as a single text line.
REM 8 = Treat the image as a single word.
REM 9 = Treat the image as a single word in a circle.
REM 10 = Treat the image as a single character.
REM 11 = Sparse text. Find as much text as possible in no particular order.
REM 12 Sparse text with OSD.
REM 13 Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.
founded on https://github.com/tesseract-ocr/tesseract/issues/434

How to compress ico file on imagemagick or other

I created a 32x32 png in photoshop exported as 500bytes.
Converted to ico using
magick convert .\favicon.png favicon.ico
And it became 5kb.
Question?
Is there there a compression flag in imagemagick or anoter way to compress favicon.ico?
I just tried #dan-mašek's suggestion and it definitely works better than ImageMagick.
With the PNGs I was working with, ImageMagick gave me a 5.4K .ico file despite asking for it to use .ico's support for embedded PNGs for the compression (apparently it ignores you for sizes under 256x256) while Pillow got it down to 1.8K.
Here's how I went about crunching down my favicons based on an existing PNG-optimizing shell script I cooked up years ago:
#!/bin/sh
optimize_png() {
for X in "$#"; do
echo "---- Using pngcrush to strip irrelevant chunks from $X ----"
# Because I don't know if OptiPNG considers them all "metadata"
pngcrush -ow -q -rem alla -rem cHRM -rem gAMA -rem iCCP -rem sRGB \
-rem time "$X" | egrep -v '^[ \|]\|'
done
echo "---- Using OptiPNG to optimize delta filters ----"
# ...and strip all "metadata"
optipng -clobber -o7 -zm1-9 -strip all -- "$#" 2>&1 | grep -v "IDAT size ="
echo "---- Using AdvanceCOMP to zopfli-optimize DEFLATE ----"
advpng -z4 "$#"
}
optimize_png 16.png 32.png
python3 << EOF
from PIL import Image
i16 = Image.open('16.png')
i32 = Image.open('32.png')
i32.save('src/favicon.ico', sizes=[(16, 16), (32, 32)], append_images=[i16])
EOF
Just be aware that:
pngcrush and advpng don't take -- as arguments, so you have to prefix ./ onto relative paths which might start with -.
.save in PIL must be called on the largest image so, if you have a dynamic list of images, you probably want something like this:
images.sort(key=lambda x: x.size)
images[-1].save('favicon.ico', sizes=[x.size for x in images], append_images=images)

How to Create Traineddata file For Tesseract 4.1.0

I want to recognise the characters of NumberPlate.
How to train the tesseract-ocr for respective number plate in ubuntu 16.04.
Since i don't familiar with training. Please help me to create a 'traineddata' file for recognizing numberplate.
I have 1000 images of number plate.
Please look into it.
Any help would be appreciate.
So I have tried the following commands
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
tesseract eng.arial.plate3655.png eng.arial.plate3655 batch.nochop makebox
But it gives error.
Tesseract Open Source OCR Engine v4.1.0-rc1-56-g7fbd with Leptonica
Error, cannot read input file eng.arial.plate3655.png: No such file or directory
Error during processing.
after that i have tried
tesseract plate4.png eng.arial.plate4 batch.nochop makebox
it works but in some plates.
Now in Step 2. I am getting error.
Screenshot is attached.
Plate 4 image for training
Step 1 and Ste p2 display in terminal
File Generated after step 1 and step 2
Content of file generated after step 1 and step 2
Creating .traineddata for Tesseract 4
{*Note : After install tesseract open cmd and do the following.}
Step 1:
Make box files for images that we want to train
Syntax:
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
Eg:
tesseract own.arial.exp0.jpg own.arial.exp0 batch.nochop makebox
{*Note:After making box files we have to change or modify wrongly identified characters in box files.}
Step 2:
Create .tr file (Compounding image file and box file)
Syntax:
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] box.train
Eg:
tesseract own.arial.exp0.jpg own.arial.exp0 box.train
step 3:
Extract the charset from the box files (Output for this command is unicharset file)
Syntax:
unicharset_extractor [langname].[fontname].[expN].box
Eg:
unicharset_extractor own.arial.exp0.box
step 4:
Create a font_properties file based on our needs.
Syntax:
echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" > font_properties
Eg:
echo "arial 0 0 1 0 0" > font_properties
Step 5:
Training the data.
Syntax:
mftraining -F font_properties -U unicharset -O [langname].unicharset [langname].[fontname].[expN].tr
Eg:
mftraining -F font_properties -U unicharset -O own.unicharset own.arial.exp0.tr
Step 6:
Syntax:
cntraining [langname].[fontname].[expN].tr
Eg:
cntraining own.arial.exp0.tr
{*Note:After step 5 and step 6 four files were created.(shapetable,inttemp,pffmtable,normproto) }
Step 7:
Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)
Syntax:
rename filename1 filename2
Eg:
rename shapetable own.shapetable
rename inttemp own.inttemp
rename pffmtable own.pffmtable
rename normproto own.normproto
Step 8:
Create .traineddata file
Syntax:
combine_tessdata [langname].
Eg:
combine_tessdata own.
{ *Note : I will use only one image exp0 for creating traineddata.if you want to train more than one image you can train i.e exp1,exp2..expn }
Reference

Can aspell output line number and not offset in pipe mode?

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).
This will output all occurrences of misspelt words with line numbers:
# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |
# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done
Where:
my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file
aspell.ignore.txt example:
personal_ws-1.1 en 500
foo
bar
example results.txt output (for an en_GB dictionary):
238:color
302:writeable
355:backends
433:dataonly
You can also print the whole line by changing the last grep -on into grep -n.
This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.
cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"
I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.
#!/bin/bash
set +o pipefail
if [ -t 1 ] ; then
color="--color=always"
fi
! for file in "$#" ; do
<"$file" aspell pipe list -p ./dict --mode=html |
grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
grep '[[:alpha:]]\+' -o |
while read word ; do
grep $color -n "\<$word\>" "$file"
done
done | grep .
You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.
Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.
aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).
Demonstration printing the line number with awk:
$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'
produces this output:
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4
with testFile.txt
iinternational
I say this reelly.
hello
here is sometypo.
(Still not as nice as hunspell -u (https://stackoverflow.com/a/10778071/4124767). But hunspell misses some command line options I like.)
For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.
ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"
for file in "$#"; do
for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
done | sort -n
done
This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.