Tesseract Assert failed trainingsampleset.cpp line 622 with mftraining - ocr

When mftraining is executed on my training files, I get the following error message:
PS > mftraining -F font_properties -U unicharset -O lang.unicharset .\eng.ds-digita
l.exp0.box.tr .\eng.ds-digitalb.exp0.box.tr .\eng.ds-digitali.exp0.box.tr
Warning: No shape table file present: shapetable
Reading .\eng.ds-digital.exp0.box.tr ...
Reading .\eng.ds-digitalb.exp0.box.tr ...
Reading .\eng.ds-digitali.exp0.box.tr ...
Font id = -1/0, class id = 1/12 on sample 0
font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file ..\..\classify\trainingsampleset.cpp, li
ne 622
A dialog from Windows also appears stating "feature training for Tesseract has stopped working". There are several posts around the net adressing this issue, but none of them (That I have tried so far) seems have any solutions to make my data-set go through.
The folder where the mftraining command is executed at contains the following files:
eng.ds-digital.exp0.box
eng.ds-digital.exp0.box.tr
eng.ds-digital.exp0.box.txt
eng.ds-digital.exp0.tif
eng.ds-digitalb.exp0.box
eng.ds-digitalb.exp0.box.tr
eng.ds-digitalb.exp0.box.txt
eng.ds-digitalb.exp0.tif
eng.ds-digitali.exp0.box
eng.ds-digitali.exp0.box.tr
eng.ds-digitali.exp0.box.txt
eng.ds-digitali.exp0.tif
font_properties
unicharset
And the font_properties has the following content (It also ends with a newline as the documentation states):
ds-digital 0 0 0 0 0
ds-digitalb 0 1 0 0 0
ds-digitali 1 0 0 0 0
I've also tried different naming conventions on the font-name on the font_properties (althought the documentation is quite clear it is the font name of the file and not the file name, but some people around the net seems to claim otherwise), and renaming the files so the .tr-files follows the pattern eng.ds-digital*.exp0.tr without anvil.
Edit: I am running on Tesseract 3.02

I was getting same issue and resolved by checking Font name in eng.ds-digital.exp0.box.tr should be same as you given in font_properties file.
Example:
echo "ds-digital 0 0 0 0 0" > font_properties
then eng.ds-digital.exp0.box.tr should have ds-digital font name.
another easy way to train tesseract link.

Related

Failed in generating Tesseract traineddata

I'm using Tesseract v5.0.1.20220118 on Windows 10, training a font only have letter "P" and "Q".
When I get to the step
mftraining -F font_properties.txt -U unicharset -O normal.unicharset pq.normal.exp0.tr
The pffmtable file is not generated.
And when I run code cntraining pq.normal.exp0.tr
It shows me
Reading pq.normal.exp0.tr ...
Clustering ...
N == sizeof(Cluster->Mean):Error:Assert failed:in file ../../../src/classify/cluster.cpp, line 2526
Why it goes wrong? How can I fix it?
I only have inttemp and shapetable generated, but the tutorial says there will be four files include shapetable, inttemp, pffmtable and normproto, I wonder that maybe is beacuse of the font only have letter "P" and "Q", but I have no idea how to solve it.
Please read the docs:
https://tesseract-ocr.github.io/tessdoc/#training-for-tesseract-5
Use the right tools:
https://github.com/tesseract-ocr/tesstrain

Cannot generate synthetic image samples data from Docker Isaac sim Object Detection Training

Creating data folder /workspace/tlt-experiments/data WARNING: flashing images!! ***** Don’t visit the following URL if you are sensitive to flashing lights ******* Go to http://localhost:3000 see the generated images being generated
Generated 0 samples
Generated 0 samples
Generated 0 samples
Generated 0 samples
Generated 0 samples
This issue can occur in Isaac 20 and 21, the start.sh script expect version to be 19. So, We need to modify the start.sh file. Need to comment below lines
Modify the start.sh like this
#if [[ $IS_19 == 1 ]]; then
NV_FLAG="--gpus=all"
#else
# NV_FLAG="--runtime=nvidia -e CUDA_VISIBLE_DEVICES=all"
#fi
and keep only NV_FLAG="--gpus=all".

How to Create Traineddata file For Tesseract 4.1.0

I want to recognise the characters of NumberPlate.
How to train the tesseract-ocr for respective number plate in ubuntu 16.04.
Since i don't familiar with training. Please help me to create a 'traineddata' file for recognizing numberplate.
I have 1000 images of number plate.
Please look into it.
Any help would be appreciate.
So I have tried the following commands
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
tesseract eng.arial.plate3655.png eng.arial.plate3655 batch.nochop makebox
But it gives error.
Tesseract Open Source OCR Engine v4.1.0-rc1-56-g7fbd with Leptonica
Error, cannot read input file eng.arial.plate3655.png: No such file or directory
Error during processing.
after that i have tried
tesseract plate4.png eng.arial.plate4 batch.nochop makebox
it works but in some plates.
Now in Step 2. I am getting error.
Screenshot is attached.
Plate 4 image for training
Step 1 and Ste p2 display in terminal
File Generated after step 1 and step 2
Content of file generated after step 1 and step 2
Creating .traineddata for Tesseract 4
{*Note : After install tesseract open cmd and do the following.}
Step 1:
Make box files for images that we want to train
Syntax:
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
Eg:
tesseract own.arial.exp0.jpg own.arial.exp0 batch.nochop makebox
{*Note:After making box files we have to change or modify wrongly identified characters in box files.}
Step 2:
Create .tr file (Compounding image file and box file)
Syntax:
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] box.train
Eg:
tesseract own.arial.exp0.jpg own.arial.exp0 box.train
step 3:
Extract the charset from the box files (Output for this command is unicharset file)
Syntax:
unicharset_extractor [langname].[fontname].[expN].box
Eg:
unicharset_extractor own.arial.exp0.box
step 4:
Create a font_properties file based on our needs.
Syntax:
echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" > font_properties
Eg:
echo "arial 0 0 1 0 0" > font_properties
Step 5:
Training the data.
Syntax:
mftraining -F font_properties -U unicharset -O [langname].unicharset [langname].[fontname].[expN].tr
Eg:
mftraining -F font_properties -U unicharset -O own.unicharset own.arial.exp0.tr
Step 6:
Syntax:
cntraining [langname].[fontname].[expN].tr
Eg:
cntraining own.arial.exp0.tr
{*Note:After step 5 and step 6 four files were created.(shapetable,inttemp,pffmtable,normproto) }
Step 7:
Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)
Syntax:
rename filename1 filename2
Eg:
rename shapetable own.shapetable
rename inttemp own.inttemp
rename pffmtable own.pffmtable
rename normproto own.normproto
Step 8:
Create .traineddata file
Syntax:
combine_tessdata [langname].
Eg:
combine_tessdata own.
{ *Note : I will use only one image exp0 for creating traineddata.if you want to train more than one image you can train i.e exp1,exp2..expn }
Reference

Windows .bat file 0< not sure from where the 0 is coming from

I have a strange problem with my windows .bat files a 0 is coming before < while executing. I don't know where its getting it from. Below is the contents of the batch file date1.bat
set mysql="C:\Program Files\MySQL\MySQL Server 8.0\bin\mysql.exe"
set progDir="D:\BigData\14.Nodejs\3.Firebase"
set dataDir=D:\BigData\14.Nodejs\3.Firebase\data
%mysql% -ualpha -pbeta test < "%dataDir%\LatestData - Q -201811 - INSERT DMLs.sql"
Issue I am referring to comes in the line
%mysql% -ualpha -pbeta test < "%dataDir%\LatestData - Q -201811 - INSERT DMLs.sql"
Below is the output
D:\BigData\14.Nodejs\3.Firebase>date1
D:\BigData\14.Nodejs\3.Firebase>set mysql="C:\Program Files\MySQL\MySQL Server 8.0\bin\mysql.exe"
D:\BigData\14.Nodejs\3.Firebase>set progDir="D:\BigData\14.Nodejs\3.Firebase"
D:\BigData\14.Nodejs\3.Firebase>set dataDir=D:\BigData\14.Nodejs\3.Firebase\data
D:\BigData\14.Nodejs\3.Firebase>"C:\Program Files\MySQL\MySQL Server 8.0\bin\mysql.exe" -ualpha -pbeta test 0<"D:\BigData\14.Nodejs\3.Firebase\data\LatestData - Q -201811 - INSERT DMLs.sql"
In the last time you can see a "0<" not sure where its getting that 0 from. Is there a way to avoid it.
I am just trying to run DMLs in multiple files via windows batch.
0 means standard input. 0< myfile means send the contents of myfile to standard input. < myfile is shorthand for 0< myfile. The 0 is doing no harm and you don't need to get rid of it.
What you are viewing is the echo of commands as how the interpreter evaluates the code.
Handle 0 is Stdin which < redirection is interpreted as from handle 0<.
Handle 1 is Stdout which > redirection is interpreted as from handle 1> or to handle >&1.
Handle 2 is Stderr which 2> redirection is interpreted as from handle 2> or to handle>&2.
Handles 3 to 9 are auxiliary handles unique to batch-file.

Squid StoreId rewrite

I try to configure my proxy to de-duplicate some cached files.
Some site add query-string at the end of URL and so the file is cached multiple times. Ex :
http://download.oracle.com/otn-pub/java/jdk/7u75-b13/jre-7u75-linux-x64.tar.gz?AuthParam=kjzeghfhrehbfgjernf
http://download.oracle.com/otn-pub/java/jdk/7u75-b13/jre-7u75-linux-x64.tar.gz?AuthParam=jzehrguihegeijhpijf
I would like to create et rewrite rule for storeId like that :
^http:\/\/download\.oracle\.com\/otn\-pub\/java\/([a-zA-Z0-9\/\.\-\_]+\.(tar\.gz)) http://download.oracle.com/otn-pub/java/$1
but I have'nt found documention about how to do that.
Ok, so after long research I have find the answer to my question. I write here if case of someone else have the same question.
First of all, I have install Squid 3.4, the first version witch support StoreId rewrite.
Second, after reading StoreId documentation :
wiki.squid-cache.org/Features/StoreID
wiki.squid-cache.org/Features/StoreID/DB
and lot of google search I found this perl program http://pastebin.ca/2422099. It take a database file as first argument, you can find examples in the second link before. In the file I have had a line as above :
^http:\/\/download\.oracle\.com\/otn\-pub\/java\/([a-zA-Z0-9\/\.\-\_]+\.(tar\.gz)) http://download.oracle.com/otn-pub/java/$1
Third, in my squid.conf, I had this line :
store_id_program /usr/local/squid/store-id.pl /usr/local/squid/store_id_db
store_id_children 5 startup=1
store_id_program is the path to the perl file with in argument the database file.
store_id_children represent the number of subprocess allowed to the program, maximum 5, 1 at the beginning.
In the same squid.conf I replace this line :
refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
by
refresh_pattern -i cgi-bin 0 0% 0
to allow caching url with query string.
Last, I ensure that the store-id.pl has 'x' permission
Hope this help :)
PS: Just a trick, in the db file, you must have to columns separate by a tabulation (not a space). To be sure, you can use this command (find in doc):
cat dbfile | sed -r -e 's/\s+/\t/g' |sed '/^\#/d' >cleaned_db_file