How to Create Traineddata file For Tesseract 4.1.0 - ocr

I want to recognise the characters of NumberPlate.
How to train the tesseract-ocr for respective number plate in ubuntu 16.04.
Since i don't familiar with training. Please help me to create a 'traineddata' file for recognizing numberplate.
I have 1000 images of number plate.
Please look into it.
Any help would be appreciate.
So I have tried the following commands
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
tesseract eng.arial.plate3655.png eng.arial.plate3655 batch.nochop makebox
But it gives error.
Tesseract Open Source OCR Engine v4.1.0-rc1-56-g7fbd with Leptonica
Error, cannot read input file eng.arial.plate3655.png: No such file or directory
Error during processing.
after that i have tried
tesseract plate4.png eng.arial.plate4 batch.nochop makebox
it works but in some plates.
Now in Step 2. I am getting error.
Screenshot is attached.
Plate 4 image for training
Step 1 and Ste p2 display in terminal
File Generated after step 1 and step 2
Content of file generated after step 1 and step 2

Creating .traineddata for Tesseract 4
{*Note : After install tesseract open cmd and do the following.}
Step 1:
Make box files for images that we want to train
Syntax:
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
Eg:
tesseract own.arial.exp0.jpg own.arial.exp0 batch.nochop makebox
{*Note:After making box files we have to change or modify wrongly identified characters in box files.}
Step 2:
Create .tr file (Compounding image file and box file)
Syntax:
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] box.train
Eg:
tesseract own.arial.exp0.jpg own.arial.exp0 box.train
step 3:
Extract the charset from the box files (Output for this command is unicharset file)
Syntax:
unicharset_extractor [langname].[fontname].[expN].box
Eg:
unicharset_extractor own.arial.exp0.box
step 4:
Create a font_properties file based on our needs.
Syntax:
echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" > font_properties
Eg:
echo "arial 0 0 1 0 0" > font_properties
Step 5:
Training the data.
Syntax:
mftraining -F font_properties -U unicharset -O [langname].unicharset [langname].[fontname].[expN].tr
Eg:
mftraining -F font_properties -U unicharset -O own.unicharset own.arial.exp0.tr
Step 6:
Syntax:
cntraining [langname].[fontname].[expN].tr
Eg:
cntraining own.arial.exp0.tr
{*Note:After step 5 and step 6 four files were created.(shapetable,inttemp,pffmtable,normproto) }
Step 7:
Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)
Syntax:
rename filename1 filename2
Eg:
rename shapetable own.shapetable
rename inttemp own.inttemp
rename pffmtable own.pffmtable
rename normproto own.normproto
Step 8:
Create .traineddata file
Syntax:
combine_tessdata [langname].
Eg:
combine_tessdata own.
{ *Note : I will use only one image exp0 for creating traineddata.if you want to train more than one image you can train i.e exp1,exp2..expn }
Reference

Related

Failed in generating Tesseract traineddata

I'm using Tesseract v5.0.1.20220118 on Windows 10, training a font only have letter "P" and "Q".
When I get to the step
mftraining -F font_properties.txt -U unicharset -O normal.unicharset pq.normal.exp0.tr
The pffmtable file is not generated.
And when I run code cntraining pq.normal.exp0.tr
It shows me
Reading pq.normal.exp0.tr ...
Clustering ...
N == sizeof(Cluster->Mean):Error:Assert failed:in file ../../../src/classify/cluster.cpp, line 2526
Why it goes wrong? How can I fix it?
I only have inttemp and shapetable generated, but the tutorial says there will be four files include shapetable, inttemp, pffmtable and normproto, I wonder that maybe is beacuse of the font only have letter "P" and "Q", but I have no idea how to solve it.
Please read the docs:
https://tesseract-ocr.github.io/tessdoc/#training-for-tesseract-5
Use the right tools:
https://github.com/tesseract-ocr/tesstrain

Issue to train tesseract-OCR 4 - Empy shape table

I am trying to train Tesseract 4 with particular pictures (to read multimeters with 7 segments),
please note that I am aware of the allready trained data from Arthur Augusto at https://github.com/arturaugusto/display_ocr but I need to train Tesseract over my own data.
In order to train tess, I followed differents tutorials (as https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models-4ba9861595e7 or https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/)
but i allways get problem when running the shapeclustering command with my own data
(With example data as https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972, every things is working fine)
Indeed when I try to do the shapeclusturing command it have this output screenshot
Then my shape_table is empty and the trainig could'nt be efficient...
With example data it's working fine and the shape_table is well filled
I am guessing that I have issue with box file generation, here is my process to create box file :
I use the
tesseract imageFileName.tif imageFileName batch.nochop makebox
command to generate box file and then i edit it with JtessboxEditor.
So I can't see where I'am wrong with my .box/.tif data couple.
Have a good day & thanks for helping me
\n
Adrien
Here is my full batch script for training after having generated and edited box files.
set name=sev7.exp0
set shortName=sev7
echo Run Tesseract for Training..
tesseract.exe %name%.tif %name% nobatch box.train
echo Compute the Character Set..
unicharset_extractor.exe %name%.box
shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering..
cntraining.exe %name%.tr
echo Rename Files..
rename normproto %shortName%.normproto
rename inttemp %shortName%.inttemp
rename pffmtable %shortName%.pffmtable
rename shapetable %shortName%.shapetable
echo Create Tessdata..
combine_tessdata.exe %shortName%.
echo. & pause
Ok so finally I achieved to train tesseract.
The solution is to add a --psm parameter when using the command
tesseract.exe %name%.tif %name% nobatch box.train
as
tesseract.exe %name%.%typeFile% %name% --psm %psm% nobatch box.train
note that all the psm value are :
REM pagesegmode values are:
REM 0 = Orientation and script detection (OSD) only.
REM 1 = Automatic page segmentation with OSD.
REM 2 = Automatic page segmentation, but no OSD, or OCR
REM 3 = Fully automatic page segmentation, but no OSD. (Default)
REM 4 = Assume a single column of text of variable sizes.
REM 5 = Assume a single uniform block of vertically aligned text.
REM 6 = Assume a single uniform block of text.
REM 7 = Treat the image as a single text line.
REM 8 = Treat the image as a single word.
REM 9 = Treat the image as a single word in a circle.
REM 10 = Treat the image as a single character.
REM 11 = Sparse text. Find as much text as possible in no particular order.
REM 12 Sparse text with OSD.
REM 13 Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.
founded on https://github.com/tesseract-ocr/tesseract/issues/434

tesseract 5.0 bazaar + user-words config doesn't work

I tried to force tesseract to use only my words list when perform OCR.
First, i copy bazaar file to /usr/share/tesseract-ocr/5/tessdata/configs/. This is my bazaar file:
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
Then, i created eng.user-words in /usr/share/tesseract-ocr/5/tessdata. This is my user-words file:
Items
VAT
included
CASH
then i perform ocr for this image by command: tesseract -l eng --oem 2 test_small.jpg stdout bazaar.
this is my result:
2 Item(s) (VAT includsd) 36,000
casH 40,000
CHANGE 4. 000
As you can see, includsd is not in my user-words file, and it should be 'included'. Besides, i got same result even without using bazaaz config in command. It looks like that my bazaar and eng.user-words config doesn't have any effect in OCR output. So how can use bazaar and user-words config, in order to get desired result ?
All you need to do was up-sampling the image.
If you up-sample two - times
Now read:
2 Item(s) (VAT included) 36,000
CASH 40,000
CHANGE 4,000
Code:
import cv2
import pytesseract
# Load the image
img = cv2.imread("4nGXo.jpg")
# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Up-sample
gry = cv2.resize(gry, (0, 0), fx=2, fy=2)
# OCR
print(pytesseract.image_to_string(gry))
# Display
cv2.imshow("", gry)
cv2.waitKey(0)
user_words_suffix does not seem to work for --oem 2.
A workaround is to use user_words_file which contains the path to your user-words file.

Tesseract training archieves a worse result than without training

I used the Windows installer of tesseract-ocr 3.02.02 (I didn't find a newer one for 3.04). My image is a JPEG with a resolution of 600dpi (3507x4960) which is a scanned blank "certificate of incapacity for work". The OCR result without training is much more accurate than after training. So what am I doing wrong?
This way I build my box file:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg %TESSLANG% -l deu batch.nochop makebox
Using jTessBoxEditor I fixed every box by hand. Then I started the training:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg %TESSLANG% -l deu nobatch box.train
unicharset_extractor %TESSLANG%.box
shapeclustering -F font_properties -U unicharset %TESSLANG%.tr
mftraining -F font_properties -U unicharset -O %LANG%.unicharset %TESSLANG%.tr
cntraining %TESSLANG%.tr
MOVE inttemp %LANG%.inttemp
MOVE normproto %LANG%.normproto
MOVE pffmtable %LANG%.pffmtable
MOVE shapetable %LANG%.shapetable
combine_tessdata %LANG%.
COPY %LANG%.traineddata %TESSERACT_HOME%\tessdata /Y
The OCR without training (archieving the best results) is done like:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg without_training -l deu
Using the traineddata:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg with_training -l %LANG%
Maybe I am wrong but I expect a perfect result (I use the same JPEG for training and OCRing).
Here the first part of without_training.txt:
Paul Albrechts Verlag, 22952 Lütjensee Bei verspäteter Vorlage droht Krankengeldverlust!
Krankenkasse bzw. Kostenträger
Name, Vorname des Versicherten
geb. am
Kassen—Nr. Versicherten—Nr. Status
Betriebsstätten-Nr. Arzt—Nr. Datum
And the first part of with_training.txt:
Pau/A/brechrs Ver/ag, 22952 Lüfjensee Be! verspäteter vor!age droht Krankenge!dver!ust!
Krankenkasse bzw. Kostenträger
Name, Vorname des Versicherten
geb. am
Kassen-Nr. Versicherten-Nr. status
Betriebsstätten-Nr. Arzt-Nr. Datum
In my case adding the language "deu" did the trick:
tesseract %TESSLANG%.jpg with_training -l %LANG%+deu
instead of
tesseract %TESSLANG%.jpg with_training -l %LANG%

Tesseract Assert failed trainingsampleset.cpp line 622 with mftraining

When mftraining is executed on my training files, I get the following error message:
PS > mftraining -F font_properties -U unicharset -O lang.unicharset .\eng.ds-digita
l.exp0.box.tr .\eng.ds-digitalb.exp0.box.tr .\eng.ds-digitali.exp0.box.tr
Warning: No shape table file present: shapetable
Reading .\eng.ds-digital.exp0.box.tr ...
Reading .\eng.ds-digitalb.exp0.box.tr ...
Reading .\eng.ds-digitali.exp0.box.tr ...
Font id = -1/0, class id = 1/12 on sample 0
font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file ..\..\classify\trainingsampleset.cpp, li
ne 622
A dialog from Windows also appears stating "feature training for Tesseract has stopped working". There are several posts around the net adressing this issue, but none of them (That I have tried so far) seems have any solutions to make my data-set go through.
The folder where the mftraining command is executed at contains the following files:
eng.ds-digital.exp0.box
eng.ds-digital.exp0.box.tr
eng.ds-digital.exp0.box.txt
eng.ds-digital.exp0.tif
eng.ds-digitalb.exp0.box
eng.ds-digitalb.exp0.box.tr
eng.ds-digitalb.exp0.box.txt
eng.ds-digitalb.exp0.tif
eng.ds-digitali.exp0.box
eng.ds-digitali.exp0.box.tr
eng.ds-digitali.exp0.box.txt
eng.ds-digitali.exp0.tif
font_properties
unicharset
And the font_properties has the following content (It also ends with a newline as the documentation states):
ds-digital 0 0 0 0 0
ds-digitalb 0 1 0 0 0
ds-digitali 1 0 0 0 0
I've also tried different naming conventions on the font-name on the font_properties (althought the documentation is quite clear it is the font name of the file and not the file name, but some people around the net seems to claim otherwise), and renaming the files so the .tr-files follows the pattern eng.ds-digital*.exp0.tr without anvil.
Edit: I am running on Tesseract 3.02
I was getting same issue and resolved by checking Font name in eng.ds-digital.exp0.box.tr should be same as you given in font_properties file.
Example:
echo "ds-digital 0 0 0 0 0" > font_properties
then eng.ds-digital.exp0.box.tr should have ds-digital font name.
another easy way to train tesseract link.